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1: INTRODUCTION 


1.1 Information 


This course is about the transmission of information from a Sender to a Receiver. In order to 
be transmitted via a particular channel, the information needs to be encoded in the correct form, and 
possibly encrypted if it is to be kept confidential. 


The chief purpose of this course is to study information and how it is transmitted from a sender 
to a receiver. Understanding the amount of information a message contains will tell us how much 
compression there can be when we encode it. It also gives us a measure of how much duplication of 
information there is and hence the extent to which we can correct random errors that arise when the 
message is transmitted. 


An alphabet A is a finite set of symbols. For example we might use the usual alphabet: 
{a,b,c,d,..., x,y,z, “space” } 


(including a space). Often we use a set of digits {0,1,2,3,...,8,9} or binary digits {0,1}. The symbols 
from A are called letters. A message or word is a finite sequence of letters from the alphabet and is 
usually denoted by juxtaposing the letters: 


m = @10203...QAN—-14N .- 


The set of all messages from the alphabet A is denoted by A*. 


In order to convey the message m successfully, the sender needs to transform it into a form suitable 
to transmit. For example, if the message is to be recorded digitally, it will need to be transformed into 
a sequence of binary digits. So the sender encodes the message into a new alphabet 6. There is a coding 
function c: A > 6* that codes each letter from A into a finite sequence from B. These finite sequences 
are called the code words. The entire message m is then encoded as 


c*(m) = c(a,)c(az)c(a3)...c(an—1)c(an) € B* . 


This is then transmitted via some channel to the receiver who needs to decode it to recover the original 
message ™. 


The encoding function c may be chosen so that only the intended receiver can decode it. It is 
then called encryption. However, this is by no means necessary. It may be encoded simply to make 
transmission simpler, or to compress the message to take less space, or to reduce the likelihood of errors 
being introduced in transmission. We will need to consider all of these. 


SENDER CHANNEL RECEIVER 
m +> c*(m) c*(m) +> m 
message coding decoding 


For example, if we have a message in English m = “Call at 2pm.” that is to be sent in an e-mail. It 
is first encoded as binary strings using ASCII (American Standard Code for Information Interchange). 
So c(C) = 1000011, c(a) = 1100101,... and c*(m) is 


1000011 1100101 1101100 1101100 0100000 1100101 1110100 0100000 0110010 1110000 1101101 0101110 
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(Here the spaces are not part of the encoded message but simply to make the binary string more 
readable.) This binary string is then transmitted via the internet and decoded by the receiver. 


Other examples of coding are: 

Morse code 

Bar codes 

Postcodes 

Audio CDs (FLAC:error correcting; MP3: compression) 
Mobile ’phone text messages (compression) 
Compressing computer files: zip, gz (compression) 
Bank Card PINs (cryptography) 


* Digital photographs: jpeg (compression with data loss) 


1.2 Requirements 


We will need to use a small amount from a number of earlier courses. In particular it would be 
helpful if you reviewed: 


PROBABILITY Finite probability spaces, expectation, the weak law of large numbers. 
NUMBERS AND SETS Modulo arithmetic, Euclid’s algorithm. 
LINEAR ALGEBRA Vector spaces. 


GROUPS, RINGS & MODULES Finite fields Fy. 


In addition, various parts of the course are linked to other courses, notably: Markov Chains; 
Statistics; Probability and Measure; Linear Analysis. 


1.3. Recommended Books 


G.M. Goldie & R.G.E. Pinch, Communication Theory, 
Cambridge University Press 1991. 


D. Welsh Codes and Cryptography, 
Oxford University Press 1988. 


T.M. Cover & J.A. Thomas Elements of Information Theory, 
Wiley 1991. 


W. Trappe & L.C. Washington Introduction to Cryptography with Coding Theory, 
Prentice Hall, 2002. 


J.A. Buchmann, Introduction to Cryptography, (2nd Ed.) 
Springer UTM, 2004 


1.4 Example sheets 


There will be four example sheets. The first will cover the first two weeks material. 


Lecture 1 B] 


2: ENTROPY 


Let (P,Q) be a (discrete) probability on a sample space 2. For each subset A of Q, the information 
of A is 
I(A) =—log, P(A) . 
Here log, is the logarithm to base 2, so log, p = Inp/1n2. It is clear that [(A) > 0 with equality if and 
only if P(A) = 1. 


Example: 
Let (X,,) be independent Bernoulli random variables that each take the values 0 and 1 with probability 
$. For any sequence 02% 122...xuN of N values, the set A = {X1 = 21, Xg = %,..., Xn = Xn} has 
1\% 
P(A) = (5) and so T(A)=N. 


Hence the information I[(A) measures the number of binary digits (bits) that are specified. Therefore 
the units for the information are ” bits”. 


Exercise: 
Show that two sets A,B C Q are independent if and only if 


I(ANB) =1(A)+1(B). 


Let X be a discrete random variable on (2. For each value x the set {X = x} has information 
I(X =2) =—log, P(X =2). 
The expected value of this is the entropy of X: 


H(X)=—5—P(X =2) log, P(X =2:) . 


This is the average amount of information conveyed by the random variable X. Note that the entropy 
H(X) does not depend on the particular values taken by X but only on the probability distribution of 
X, that is on the values P(X = 2). 


This definition fails when one of the sets {X = x} has probability 0. In this case, note that 
h(p) = —plog, p tends to 0 as p \, 0. So we should interpret the term —P(X = x) log, P(X = 2) as 0 
when P(X = x) = 0. The entropy satisfies H(X) > 0, with equality if and only if X is almost surely 
constant. 


—P logs p 
06 
0.4 4 
0.2 4 
° | eae 
0 1/e 1 
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Example: 
Let X be a random variable and consider a function f(X) of X. Then H(f(X)) < H(X). 


For each value t taken by X set y = f(t), then 
Piajey= So Pxaz Sra =s . 


ui f(x)=y 
So we have 
H(f(X)) =— I P(F(X) = y) logs P(F(X) = y) 
=—) P(X = 2) log, P(f(X) = f(z) 
< —S P(X =2) log, P(X =2:) . 
Example: 


Let X :Q + {0,1} be a Bernoulli random variable that takes the value 1 with probability p and 0 with 
probability 1 — p. Then 

H(X) = —plogy p— (1 — p) loga(1 — p) . 
The graph of this as a function of p is concave: 


Entropy H(p, 1 — p) 
A 


1 - 


0.8 — 


0.4 
0.2 


0 | > P 
0) 0.5 1 


The entropy of X is maximal when p = 4, when it is 1. 


The entropy is the average amount of information that we gain by knowing the value of X. We 
will see later that it is (to within 1) the expected number of yes/no questions we need to ask in order to 
establish the value of X. We will use the entropy to measure the amount of information in a message 
and hence to determine how much it can be compressed. 


Lemma 2.1 Gibbs’ inequality 

Let X be a discrete random variable that takes different values with probabilities (ees; so )> pp, = 1. 
Tf (qx), is another set of positive numbers with )~ q, = 1 then 

K 


K 

— Spr logs pe < — $2 pe logs de - 
k=1 k=1 

There is equality if and only if qx = px for each k. 
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Proof: 
The inequality is equivalent to 


S¢ pein (=) <0. 
Pk 


Now the natural logarithm function is concave, so its graph lies below the tangent line at the point 
(1,0), that is 
Int<t—-1l. 


Indeed, the natural logarithm is strictly concave since its second derivative is strictly negative. Hence 
Int <t—1 unless t = 1. Therefore, 


So pein (+) <>) (# = 1) => ( — px) = 


with equality if and only if g,/p, = 1 for each k. 


Example: 
A random variable Y that takes Kk distinct values with equal probability 1/K has entropy log, K. 


If X is any random variable that takes K different values then its entropy is at most log, K. 
Thus the entropy is maximal when the values taken by X are all equally likely. 


(You should compare this with the results in thermodynamics that entropy is maximal when states are 
all equally likely.) 


Let X,Y be two random variables, then (X,Y) is also a (vector-valued) random variable with 
entropy H(X,Y). This is called the joint entropy of X and Y. Gibbs’ inequality (2.1) enables us to 
prove: 


Corollary 2.2 
Let X,Y be discrete random variables. Then the joint entropy satisfies 


H(X,Y) < H(X) + H(Y) 
with equality if and only if X and Y are independent. 


Proot: 
The joint entropy is 


H(X,Y)=—-) P(X =2,Y =y) log, P(X =2,Y =y). 


The products P(X = x)P(Y = y) sum to 1 and so Gibbs’ inequality gives 


H(X,Y) =—- 5) P(X =2,Y =y) log, P(X =2,Y¥ =y) 


) 

<—S P(X =2,Y =y) log, P(X = 2)P(Y =y) 
) 
( 


=-)°P(X =2,Y =y) (log, P(X = 2) + logy P(Y = y)) 
=-—) P(X = 2) log, P(X = 2)-— 5 > P(Y =y) log, P(Y = y) 


with equality if and only if 


P(X =2,Y =y) =P(X =2)P(Y =y) for all zy. 


This is the condition that X and Y are independent random variables. 


Lecture 2 B] 


Example: 

Let X be arandom variable that takes D values each with probability 1/D. Then X has entropy log, D. 
Now suppose that X,,X2,...,Xy are independent random variables each with the same distribution 
as X. Then the entropy of the sequence (X1, X2,...,Xy) is N log, D. 


We can refine the last corollary by considering conditional distributions. If a is a value taken by 
the random variable X with non-zero probability, then the conditional distribution of another random 
variable Y given X = x is 
P(X =2,Y =y) 

P(X = 2) 


P(Y =y|X =2)= 


This gives a random variable (Y|X = x). Now observe that 


H(X,Y)=—) P(X =2,Y =y) log, P(X =2,Y =y) 
H(X)=-) P(X =2,Y = y) log, P(X =2) 


so 


P(X =2,Y =y) 
P(X =2) 
—— y P(X = «)P(Y = y|X = 2) log, P(Y = y|X = 2) 


=) PX Snax =z). 


H(X,Y) — H(X)=—S P(X =2,¥ =y) log, : 


This is called the conditional entropy H(Y|X), so 


H(Y|X) = H(X,Y)— H(X). 


Now observe that, for each value x, we have 
A(Y|X =2x)>0 
with equality if and only if (Y|X = 2) is almost surely constant. Therefore, 
A(X,Y)— H(X) = H(Y|X) >0 
with equality if and only if Y is almost surely a function of X. These results show that 


H(X)+H(Y) > H(X,Y) > H(X). 
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3: CODING 


3.1 Prefix-free Codes 


Suppose that we encode a message m = a ,da2a3...ayn using the function c: A — B*. Each letter 
a in the alphabet A is replaced by the code word c(a), which is a finite sequence of letters from the 
alphabet B. The entire message m is coded as 


c*(m) = c(a1)c(az)c(ag)...c(an) . 


Here c* is the induced map c* : A* + B*. To be useful, it must be possible to decode c*(m) and recover 
the original message m. In this case we say that c is decodable (or decipherable). This means that c* 
must be injective. 


In order for c to be decodable, the coding function c : A + 6* must certainly be injective. However, 
this is not sufficient. For example, the code c that sends the nth letter to a string of n Os is injective 
but c* is not injective since every coded message is a string of Os. 


Let w be a word in 6*, that is a finite sequence of letters from B. Another word w’ is a prefix of 
w if w is the concatenation of w’ with another word w’, so w = w'w”. A code c: A + B* is prefix-free 
if no code word c(a) is a prefix of a different code word c(a’). 


Example: 
The code 
c: {0,1,2,3} > {0,1}* 
which maps 
[0H 0 
:1H 10 
:2+5 110 
:3H 111 


Q 2 “& oO 


is prefix-free. However, if we reversed each of the code words we would not have a prefix-free code but 
it would still be decadable. 


Any prefix-free code is decodable. For, if we receive a message w € B* we can decode it by looking 
at the prefixes of w. One of these must be c(a,) for some letter a, € A. Since the code is prefix-free, 
this letter a; is unique. Now delete this prefix and repeat the process to find az etc. Not only can we 
decode the message but we can decode it as it is received: We do not need to wait for the end of the 
message to decode the first letter. As soon as we have received all of c(a,,) we can tell what a, was. For 
this reason, prefix-free codes are sometimes called instantaneous codes or self-punctuating codes. They 
are especially convenient and we will almost always use them. 


We can also think about prefix-free codes in terms of trees. For the alphabet 6 draw a tree with a 
base vertex at the empty string @ and vertices labelled by words in B*. Each vertex w is joined to the 
vertices wb for each letter b € B. The prefixes of a word w are then all of the vertices of the tree on the 
path from the base vertex @ to w. So, a code is prefix-free if the path from any code word to the base 
vertex contains no other code word. This is a useful way to think about prefix-free codes. 


The vertices at level N in this tree are those N steps from the base vertex, so those labelled by 
words of length N. 
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Example: 
The tree for the example code above is: 


Level 0 1 2 3 


3.2 Kraft’s Inequality 


Proposition 3.1 Kraft’s Inequality I 
Let c: A > B* be a prefix-free code with the code word c(a) having length l(a). Then 


2 at (*) 
acA 

where D = |B| is the number of letters in the alphabet B. 

Proof: 


Let L = max{l(a) : a € A}. Consider the tree constructed from the prefix-free code c. The 
vertices at level L are those corresponding to words of length L, so there are D” such vertices. 


A code word c(a) with length I(a) is a prefix of D’~"(™ such vertices. Moreover, since c is prefix- 


free, no level E vertex can have two code words as a prefix. Hence, the total number of level L vertices 
with code words as prefixes is 
s- pL) 


acA 
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and this can be at most the total number of level L vertices: D”. So 


y D@® Zi 


acA 


as required. 


The converse to this is also true. 


Proposition 3.2. Kraft’s Inequality I 
Let A,B be finite alphabets with B having D letters. If l(a) are positive integers for each letter a € A, 


satisfying 
YD <1 
acA 


then there is a prefix-free code c: A — B* with each code word c(a) having length l(a). 


Proof: 
First re-order the letters in A as a),a2,...,a@K so that the lengths |; = I(a;) satisfy 


lI <lg<ig<...<lx. 
We will define the code c inductively. 
Choose any word of length J; as c(a1). 


Suppose that c(a;) has been chosen for 7 < k so that no c(a;) is a prefix of c(a;) fori <j <k. 
Consider the words of length I. There are D'* of these. Of these, D!*~'s have c(a;) as a prefix. 
However, the given inequality shows that 


k-1 k 
ify Dey De Spe, 
j=l 


j=l 


So we can choose a word c(a;) of length J, which does not have any c(a;) (for j < k) as a prefix. 


By induction, we can construct the required prefix-free code. 


Note that the code constructed above is far from unique. 


Kraft’s inequality was strengthened by McMillan who showed that it holds for any decodable code, 
even if it is not prefix-free. This means that if we are concerned only with the lengths of the code words 
we need only ever consider prefix-free codes. 


Proposition 3.3. McMillan 
Let c: A — B* be a decodable code with the code word c(a) having length l(a). Then 


a Da <4 (x) 


acA 
where D = |B| is the number of letters in the alphabet B. 
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Proof: 
Let L be the maximum length of code word c(a). 


Er) 


acA 


Consider the power 


for a natural number R. If we expand the bracket we obtain an expression of the form: 


» D~Uar)+l(a2)+...+l(ar)) 


where the sum is over all sequences a1, @2,a3,...,a@R of length R. Now the word 
w = c(a1)c(azg)c(ag)...c(ar) € B* 
has length 
|w| = U(ay) + U(ag) +... + (ar) < RL. 


Since c is decodable, every word w € B* with length |w| = m can come from at most one sequence 
a \a2...aR. Hence, 


Ro RL 
(: | < oS n(m)D-™ 
acA m=1 


where n(m) is the number of words w of length m that are of the form c(a1)c(az)c(ag)...c(ar). The 
total number of words of length m is at most D™, so n(m) < D™. Therefore 


RRL 
( pe) z x D’D-™=RL. 
m=1 


acA 


Taking Rth roots gives 
a D7) 2 GE 
acA 


and taking the limit as R — oo gives the desired result. 
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4: EFFICIENT CODES 


There is often a cost for each letter transmitted through a channel, either a financial cost or a cost 
in terms of space or time required. So we try to find codes where the expected length of code words is 
as small as possible. We will call such a code optimal. 


4.1 Shannon — Fano Codes 


Let c: A > B* be a code. We will assume that this code is decodable and, indeed, prefix-free. For 
each letter a € A, the code word c(a) has length |c(a)|. We wish to find codes where the expected value 
of this length is as small as it can be. To find an expected value we need to fix a probability distribution 
on A. Let p(a) be the probability of choosing the letter a € A. Then the expected length of a code word 


is 
Y= pla)le(a)| - 


acA 


It is convenient to let A be a random variable that takes values in A with P(A = a) = p(a). Then 
the expected length of a code word is 


p|c(A)| - 


Kraft’s inequality (3.1) shows that the lengths of the code words l(a) = |c(a)| must satisfy 


yO De <1. («) 
Moreover, shows that if we have any positive integers I(a) which satisfy («), then there 


is a prefix-free code with code word lengths I(a). Hence we need to solve the optimisation problem: 


Minimize S > p(a)l(a) subject to yp <1 andeach I(a) an integer. 


It is informative to first solve the simpler optimization problem where we do not insist that the 
values I(a) are all integers. 


Example: 
Show that the minimum value of $7 p(a)l(a) subject to 3S D~" < 1 occurs when 


logs p(a) 


U(a) = —logp pla) = —F2Ps 


If we define /(a) by this formula then we certainly have 


TD = Te) = 
and the expected length of code words becomes 


_ a) log, pla H(A 


We need to show that for any values I(a) that cid ¥> D-") < 1 we have 


— Yi p(a) logs p(a ey nu 


log, D 
Set S = S> D~") and q(a) = D~'/S. Then > q(a) = 1 and Gibbs’ inequality (2.1]) shows that 
— $5 p(a) logy p(a) < — $~ p(a) logy q(a) = S~ p(a) (I(a) logy D + logy $) 


<6 an 0) log, D + log, S < (Sra a)) log, D 


as required. 
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Proposition 4.1 
Let A be a random variable that takes values in a finite alphabet A. For any decodable code c: A — B* 
the expected length of code words satisfies 


where D is the number of letters in B. 


Proof: 
Kraft’s inequality (3.1|) shows that the lengths /(a) = |c(a)| must satisfy > D~"“™ < 1. Hence, 
we see as in the example above, that 


S~ pla)l(a) > 


as required. 


We can not always find a code that achieves equality in Proposition 4.1. However, a very simple 
argument shows that we can get close. 


Proposition 4.2 Shannon — Fano encoding 
Let A be a random variable that takes values in a finite alphabet A. There is a prefix-free code 
c: A — B* for which the expected length of code words satisfies 


where D is the number of letters in B. 


Proot: 
Kraft’s inequality shows that we only need to find integers I(a) that satisfy 
H(A) —I(a) 
S > p(a)l(a) <1+ 5D and SoD <1. 
If the lengths l(a) are to be integers, then we can not always take l(a) = —logp p(a). However, we can 
set 


l(a) = [—logp p(a)] - 


For this choice of 1(a) we have 
—logp p(a) < l(a) so pla) > DO , 


This certainly implies that 5° D~' <1, so there is a prefix-free code with these word lengths. More- 
over, 

H(A) 

logy D © 


do P(al(a) < SJ p(a) (1 — logy p(a)) = 1— $7 p(a) logy p(a) = 1+ 


Given the word lengths l(a), we can now use |Proposition 3.2| to construct the desired code c. 
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The code produced in this way is called a Shannon — Fano code. It is not quite optimal but is easy 
to construct and has expected word lengths very close to the optimal value. 


Example: 
The elements of the alphabet A = {0,1,2,3} are taken with probabilities 0.4, 0.3, 0.2,0.1. So we find: 
a pla) — logy p(a) [— log, p(a)| 
0 0.4 1.32 2 
1 0.3 1.74 2 
2 0.2 2.32 3 
3 0.1 3.32 4 


Hence we see that a Shannon — Fano code is 


c:0++ 00 
c:lH O01 
c:2+ 100 
c:3 + 1100 


This code has E 


c(A)| = 2.4 while the entropy is H(A) = 1.85. So we have, as expected, 


H(A) < Elc(A)| < H(A) +1. 


This code does not achieve the smallest possible value for E|c(A)|. For the code 


h:0H0 

h:1+4 10 
h:2++ 110 
h:3+ 111 


has E|h(A)| = 1.9. (This is indeed the optimal value and h is a Huffman code.) 


The two propositions and together give: 


Theorem 4.3 Shannon’s noiseless coding theorem 
Let A be a random variable that takes values in a finite alphabet A. An optimal code c: A — 6* for 
an alphabet B of size D satisfies 


H(A) 
log, D 


H(A) 
log, D 


< Ele(A)| < +1 


Example: 
We can ask any questions with a yes or no answer about the value taken by the random variable A. 
What is the average number of questions we need to ask to determine the value of A? 


Each question gives a map 


1 if the answer is yes for a; 
A {0,1} san { . . ; 
re 0 if the answer is no for a. 
The sequence of these questions thus gives a code c: A — {0,1}*. The sequence of answers determines 
the value of a if and only if this code is decodable. 


So Shannon’s noiseless coding theorem shows that it is possible to choose the questions with 


H(A) < Elc(A)| < AH(A)4+1. 


Thus the average number of questions required is within 1 of the entropy H(A). 
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4.2 Huffman Codes 


We need to work a little harder to obtain a code that is actually optimal. For simplicity we will 
only do this in the case where B is the binary alphabet {0,1}. Huffman gave an inductive definition of 
codes that we can prove are optimal. When Huffman was a graduate student his teacher, Fano, offered 
students a chance to avoid the examination at the end of the course by finding an algorithm for optimal 
codes. Huffman duly did this. 


Let A be a finite alphabet with probabilities p(a). We will label the elements of A as ay, @2,...,aK 
so that the probabilities p;, = p(a,) satisfy 


P2Pp22p32..-2PK - 


When Kk = 2 the Huffman code is simply 
h: {a1,@2} + {0,1}; h(ai) =0 ,A(ag) =1. 


This is clearly optimal. Suppose that a Huffman code has been defined for alphabets of size K — 1. 
Choose the two letters in A that have the smallest probabilities: px_—,; and px. Form a new alphabet 
A = {a1,d2,...,@K—2,4K—1Uax } by combining the letters ax_; and ax to give a new letter ax_1UaKx 
in A that is given the probability px_1; + px. Let h: Aa {0,1}* be a Huffman code for this new 
alphabet. Then the Huffman code for A is obtained by adding a 0 or a 1 to the end of the h code words 
for ax—, and ag: 

h(a;) for 7 =1,2,...,K —2; 


h(a;) = h(ax-1 Uak)0 for j= Kk —1; 


h(aK-1 U ar)1 forj=K. 
This defines the Huffman code. It is clearly prefix-free. 


Example: 

Consider the alphabet A = {0,1,2,3,4} with probabilities 0.4, 0.2,0.15,0.15,0.1. The letters with the 
smallest probabilities are 3 and 4, so combine these to form 3U 4. This gives {0,1,2,3 U4} with 
probabilities 0.4, 0.2, 0.15, 0.25. 


The letters with the smallest probabilities are now 1 and 2, so combine these to form 1 U 2. This 
gives 0,1 U 2,3U 4 with probabilities 0.4, 0.35, 0.25. 


The letters with the smallest probabilities are now 1U2, 3U4, so combine these to form (1U2)U(3U4) 
with probability 0.6. 


Hence we see that a Huffman code is: 


:0H0 

: 16 100 
:26 101 
:34 110 
:44 111 


rr os 


This has expected code length E|h(A)| = 0.4 + 3 x 0.6 = 2.2 compared with an entropy H(A) = 2.15. 


What is the Shannon — Fano code? Is it optimal? 


We will now prove that Huffman codes are optimal. 


Theorem 4.4 Huffman codes are optimal 
A Huffman code is optimal. That is, its average length of code words is as small as possible for any 
decodable code from the alphabet A into {0, 1}*. 
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As above, we label the letters in A as aj, a@2,...,a«K with probabilities pj > po > p3 >... > pK >0. 
The random variable A takes the value a; with probability p;. So the average code length is E|c(A)|. 


Lemma 4.5 Optimal codes 
There is an optimal code c: A > {0,1}* that satisfies: 


(a) The lengths 1; = |c(a;)| are ordered inversely to the probabilities, sol; < ly <... <x. 


(b) The code words c(ax—1) and c(ax) have the same length and differ only in the last bit. 


Proof: 

We know that there is some prefix-free code for A. Let its expected code length be C. Then 
there are only a finite number of codes c with }> pzl, < C. So there must be a prefix-free code that 
achieves the minimum value for > pxl,. This is an optimal code. 


(a) If p; > p; but 1; > 1;, then we could reduce )7 pgl, by interchanging the code words c(a;) and 
c(a;). Hence we must have ly < lo <Iig3 <... <I. 


(b) Let Z be the maximum code word length, so L = 1x. Write the code word c(ax) as wb where w 
is a word of length L — 1 and 6 is the final bit. Consider the new code we obtain by changing the 
single code word c(ax) to w. This reduces the length and so reduces }) p;l;. Since c was optimal, 
this new code can not be prefix-free. Hence there must be another code word c(a;) that has w as 
a prefix. This means that c(a;) and c(ax) are both of length L and differ only in their last bit. 


Permute the code words of length LD so that c(ax—1) and c(ax) differ only in their last bit. 


We now return to the proof of Theorem 4.4: Huffman codes are optimal. We will prove this by 
induction on the size K of A. The result is clear for K = 2. Assume that the result is true for alphabets 
with size K — 1. 


Let h: A — {0,1}* be a Huffman code and c: A — {0,1}* an optimal code. The construction of 
the Huffman code shows that there is a Huffman code h on the new alphabet 
A= {04,09,...,0K-2,0K-1 Vax} 
of size K — 1. The letters here have probabilities p,,po,...,DK—2,PK—1 + PK respectively and we will 
write A for a random variable that takes the letters in A with these probabilities.. The code h is then 
given by 


h(ax-1Uak)0 ifgj=K-—1,; 


h(a;) = Rlax—1 Uak)l ifj=K; 
h(a;) ifj < K -2. 
This means that the expected code length for h and h satisfy 


s|h(A)| = E|A(A)| + (pK-1+ Px) - 


For the optimal code c, we apply the lemma [(4.5)] So choose c with c(ax_1) and c(ax) differing 
only in their last bit, say c(axK_—1) = w0 and c(ax) = wl for a word w € {0,1}*. Define a new code on 
A by setting c(ax_1 Uax) = w and C(a;) = c(a;) for 7 < K — 2. This is readily seen to be prefix-free 
and has 


s|c(A)| = Ele(A)| + (px-1 +x) - 


The inductive hypothesis shows that his optimal. So 
|h(A)| < Ele(A)| . 


Therefore, 


|h(A)| < Elc(A)| . 
However, E|c(A)| is the minimal average code length, so we must have equality and hence h is itself 
optimal. 
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5: COMPRESSION 


Throughout this section A will be an alphabet of K letters. We wish to send messages m = 
ajaz2a3... with each a; € A. We will let Aj be a random variable that gives the jth letter and we will 
assume that each A, is identically distributed with P(A; = a) = p(a). We wish to encode the message 
by strings in an alphabet 6 of size D = |B]. 


5.1 Block Codes 


In the last section we considered a code c: A + B*. The expected code length for this code is 
s\e(A1)| = S$ p(a)|c(a)| - 


The noiseless coding theorem showed that there was an optimal code c for which 


H(A1) 
logy D 


(Ai) 
log, D © 


< E]c(A1)| < 14+ 


However, rather than encoding the message one letter at a time we could divide it into blocks and 
encode each block. Divide the message m into blocks of length r: 


m= (a, a2 wae Ap) (Ap 41Ap+2 soe G2) (Gap4142p42 sae a3) wae 


Each block is in the new alphabet A” and we can find an optimal code for this: 


H(A), Ao,..., Ar) 2 ety ts 2k A es ee 
log, D 


ay . ith < 
cr: A" 3 B wi log, D 


We call this a block code. Of course, such a block code deals with r letters of A at a time and so the 
tlc, (Ai, Ag, aes ,A,)| 


r 


. This satisfies 


average code length per letter is 


H(A, Ag,..., Ar) < ilc-(A1, Ag,..., Ar) < 1 5 H(A, Ag,..., Ar) 


r logs D ~ r r r logs D 


H(A), Ao,..., Ar) 


Thus the expected code length per letter is approximately 
r logy D 


In |Corollary 2.2} we saw that 


H(A}, Ao,..., Ar) < H(A1) + H(Aa) +...+ H(A,) 


with equality when the letters Aj, A2,... are independent. The random variables (A,) all have the same 
distribution so we see that 
H(A), Ao, ae ,A,) < rH(Aj) : 


Consider first the case where the letters (A;) are independent. Then H(Aj, Ao,...,A-) = rH(A1) 
and so we see that the optimal block code satisfies 


H(A,) < E|c,(A,, A2,...,Ay)| 2 if ; H(A,) 


logy D ~ r r log, D © 


So the expected code length per letter tends to H(A;)/ log, D as the block length r increases to infinity. 


We can also deal with more complicated cases where the letters (A;) are not independent. In 
English the letters are far from independent; for example the letter following q is very likely to be u. 
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As a simple model we will assume that the probability that A; takes a value aj; depends only on the 
previous letter A;_1, so 


P(A; = aj\@y = 01, Ay = @a,..+, Aga = O34) = PA; = 0,j|Aj a =0;-1) 
and that this is independent of 7. This implies that 
H(A;|Ai, Ao,..., Aj-1) = H(A;|Aj-1) = H(A2|A1) - 
Now we know that 
H(A), Ao,..., Ar) = H(A,|A1, Ao,..., Ar—1) + H(A1, Ao,..-, Ar—1) 
so we see that 
H(A}, Ao,..., Ar) = H(A,|Ar—1) +H (Ar-1|Ar—2)+. . +H (A2|41)+H(A1) = (7-1) H(Ag|A1)+4(A1). 


Hence we see that the expected code length per letter tends to H(A2|Ai)/log, D as r increases to 
infinity. 


(In the situation described here, the successive letters A; form a Markov chain with the probability 
distribution p(a) as the invariant measure.) 


5.2 Compression 


The maximum value for the entropy H(A) arises when all the K letters are equally likely and is 
logy K. In this case, the optimal code satisfies 


H(Ai) 
log, D 


H(Ai) 
logy D 


logy K = < Elc(Ay)| < 1+ = logp(DK) . 


There are K possibilities for letters from A. Each letter in the code word c(a) has D possibilities so the 
number of code words of length L is D” and this is K when L = logp K, which is approximately the 
expected code length. Hence, in this case we do not get any compression of the message. 


However, if the distribution of the letters is not uniform, then the entropy H(A1z) is strictly less 
than log, K and so the optimal code does compress the message. 


If we use block codes and the letters in our messages are not independent then there is much greater 
scope for compression. For then we know that 


H(Aj, Ao,..., Ar) 


r 


< H(A,) 


so the expected code length per letter will be strictly smaller than 


fg ae) a) 
r r logs D r logy D © 


H(A, Ao,..., Ar) 
r logy D 

r. This gives us an upper limit on the amount by which we can compress the message (on average) 

without loosing information. 


The entropy measures the amount of information per letter in a block of length 
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6: NOISY CHANNELS 


So far we have assumed that each coded message is transmitted without error through the channel 
to the receiver where it can be decoded. In many practical applications, there is a small but appreciable 
probability of error, so a letter may be altered in the channel. We want to devise codes that detect, or 
even correct, such errors. 


6.1 Memoryless Channels 


A coded message b1b2b3...bx € 6* is sent through a channel C' where each letter b,, may be altered 
to another. We will assume that any changes to one letter are independent of the other letters and what 
happens to them. In this case we say the channel is memoryless. So, if we write B, for the random 
variable that gives the nth letter of the message and B’, for the random variable giving the nth letter 
received, then 


N 
P(bi bbs ... iy received |byb2b3...by sent ) = |] P(B), = 0),|Bn = bn) - 
n=1 
This gives a transition matrix 
PB, =iBy=9) for i,j € B 


describing how the nth letter is altered. We expect the probability of error to be small, so the probability 
P(B/, = i|B,, = 7%) should be close to 1. 


We will also assume that the chances of a particular alteration does not depend on the index n, so 
the channel is téme independent. Then the transition matrices do not change with n. 


Example: 
Let B = {0,1,2,3,4,5,6,7,8,9} and the transition matrix be 
3° when j =i; 
P(j received |i sent ) = i when j =i+1; 
0 otherwise. 


If I send a word 6;b2b3...by, the probability that it is received without error is only (3) : 


I can improve the chances of receiving the message without error by sending each letter 3 times in 
succession. The probability that at least 2 of the 3 are received without error is then 
3)3 1) (3)? _ 27 _ 
(7) +3(4) (4) = 33 = 0.84375 
which is larger than 3. So we have increased the probability of decoding the message without error to 


N : : ; : 
(33) albeit at the price of sending a message three times as long. 


We can do much better than this. For suppose that we first recode our message using the shorter 
alphabet € = {0,2,4,6,8}. If a letter i € € is sent, then either 7 or i + 1 is received. In either case we 
can tell with certainty that 7 was sent. So we can decode the message perfectly. 


6.2 The Binary Symmetric Channel 


For most of this course we will use binary codes for simplicity. When B = {0,1}, the binary 
symmetric channel (BSC) is a time independent, memoryless channel with transition matrix 


ere 
D 1l—p 


Here p is the probability of error in a single bit. Usually it is small. Note that if p = 4 then no 
information is transmitted, everything is noise. When p = 0 there is no noise and no errors to be 
corrected. 


Exercise: 
In a binary symmetric channel we usually take the probability p of error to be less than 1/2. Why do 
we not consider 1 > p > 1/2? 
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Example: 
Punched computer tapes traditionally had 8 spaces per row and used holes to indicate the 1 bit. Of 
these bits the first 7 (#1, 22,...,@7) carried information while the last 2g was a check digit satisfying: 


ty +%9+...27+ 23 =0 (mod 2) . * 
We can model the chances of errors in the punched tape using a binary symmetric channel with error 
probability p. Then the probability of receiving the information 2122 ...27 without error is (1—p)’. If 
there is a one error in the 8 bits 142...x%7a%g then (*) will not hold for the received message and this 
will show that there was an error. In this case the information could be sent again. 


Note that, when 2 errors occur, (*) remains true and we do not know that there has been an error. 


For instance, if p = 0.1, then the probability of 0 errors in 7,2%2...27 is 0.48, the probability of 1 
error that is detected by the check digit is 0.33. 


Exercise: 
Show that the probability of detecting errors at all using («) is $(1— (1 — 2p)). 


6.3 Check Digits 


Check digits are used very widely to detect errors. 


Example: 
University Candidate Numbers are of the form 


1234A, 1235B, 1236C, 1237D, ... 


The first 4 digits identify the candidate, while the final letter is a check digit. If a candidate writes one 
letter incorrectly, then he or she will produce an invalid candidate number. The desk number can then 
be used to correct the error. 


Exercise: 
Books are identified by the ISBN-10 code. For example: 


ISBN 0-521-404568 


This consists of 9 decimal digits x1-x%2%3%4-%5%@27Xgx9 Which identify the publisher, author and title. 
The final digit x1) € {0,1,2,...,9, X} is a check digit which satisfies 


10x; + 9x2 + 843 +...+ 29 + 19 =0 (mod 11). 


Check the ISBN number above is valid. Show that altering any single digit or transposing any two 
adjacent digits in an ISBN-10 code will always result in an invalid code. Hence such errors can be 
detected. 


How do check digits work in ISBN-13? Do they detect all single digit errors? 
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6.4 The Hamming Code 


In the earlier lectures we saw that, when there was redundant information in a message, we could 
compress it. Alternatively, we can expand a code to add redundant information. This gives us the 
possibility of detecting errors or even correcting them. 


Hamming produced a binary code that can correct a single error. This code takes a binary word 
of length 4 and codes it as a binary word of length 7. When we receive a word we look for the closest 
possible code word. Provided that there has been no more than one error in the 7 bits, we can always 
find that code word correctly and hence decode it. 


The integers modulo 2 form a field Fz and we can use these to represent binary bits. So a binary 
string of length N is represented by a vector in the N-dimensional vector space Fi’ over F2. Hamming’s 
code is given by a linear map C : F4 > F. This map is given by the matrix: 


1 0 0 0 
0 1 0 0 
0 0 1 0 
C=]0 001 
0 11 éii1 
1 01 1 
1 10 1 
So a word x € F$ is coded as 
1 0 0 0 ry 
0 1 0 0 LQ 
0010] (*} v3 
Czr=10001]] 7? ]= vA 
0111;\%3 to ta3 +24 
1011 - %1+2%34+%4 
1101 @+%4+ 04 


The first 4 bits carry the information and the remaining 3 are check digits. 
We can check that each column y of the matrix C satisfies 


Yi +y3 +5 + y7 =0 
Yo + 3 + Y¥6 + y7 = 0 T 
yatystyet+y7=0. 


Hence each code word y = Ca must also satisfy these conditions ({). The converse is also true. Suppose 
that y satisfies ({). Then the equations determine ys, ye, yz in terms of 1, y2, y3, ya. SO we see that 


y = Ca where g=| EF. 


Hence the equations ({) determine precisely when y is a code word or not. 
We can rewrite the linear equations (}) in matrix terms as 
101 010 1 
Sy={0 11001 1)y={0 
000111421 0 


This matrix S is called the syndrome matriz. We have seen that SCax = 0 for every string x € F%. 
Indeed, So C = 0 and the image C(F4) is equal to the kernel ker 9. 
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Suppose that the string 2 € F3 is encoded as Ca but an error occurs in the jth bit and nowhere 
else. Then we receive 
y=Cx+e; 


where e; is the vector with 1 in the jth place and Os elsewhere. We know that Cz satisfies ({) but e, 
does not. Hence 
Sy = SCa+ Se; , 


which is the jth column of the syndrome matrix S$. Note that this jth column is simply the bits of 7 
expanded in binary and reversed. For example 


0 
Seg=|1 corresponds to 1102 =6. 
1 


In this way we can tell where the error occurred. Since the error is a change from 0 to 1 or 1 to 0, we 
can then correct it. 


If I receive a message y in which two bits have errors, say y = Ca + e; + e;, then the syndrome is 
Sy = Se, + Se; . 


Since 7 and j are different, the syndromes Se; and Se; are also different. So their sum Se; + Se; is not 
zero. Hence Sy is non-zero and tells us that there have been errors. However, it does not tell us what 
those errors are. For example, if we receive the string 1000000, then the syndrome is 100. This may 
have arisen from the code word 0000000 with one error or the code word 1000011 with two errors. 


To summarise, the Hamming code detects 2 errors and can correct 1 error. 


We can measure the distance between two vectors y and z in FY by counting the number of places 
where they differ. This gives 


d(y,z) =|{j: yj A z$1, 


which is called the Hamming distance. So the Hamming distance is the number of bit errors required 
to change y to z. It is simple to see that this is a metric on FY and we will always use this metric. 


Let # and 2’ be two different strings in F$, and Cx,Cz’ the corresponding code words. Their 
difference 
Ca! — Ca = C(a'— 2) 


is the sum of some of the columns of the matrix C. It is easy to check that each such sum has at least 
3 non-zero bits. So d(Ca’,Ca) must be at least 3. It can be 3, for example d(1000011, 0001111) = 3. 


Suppose that we receive a string y € Ff and wish to decode it. If y is a code word Cx then we 
decode it as a. If there is one error in y, we can use the syndrome to find and correct that error. So 
we find a string w € F} with d(y,Ca) = 1. If there are two errors, it is natural to look for the string 
x € F3 for which Cz is closest to y, that is 


d(y, C2) 
is minimal. We would then guess that y arose from a with d(y,Ca) errors. The guess may not be 


unique. For example, if we receive 0000001, this is at distance 2 from both 1000011 = C(1000) and 
0100101 = C(0100). 


Lecture 6 


7; ERROR CORRECTING CODES 


In this lecture we will consider only codes c : A > B% of constant length N. We will use the 


discrete metric on BL: 
0 when b= v; 
albsr) = 49 when b#v. 


The Hamming distance on B™ is then 


N 


d(b,v) =~ d(b;,0;) . 


gat 


It is clear that this is a metric on BN. Note that the Hamming distance d(c,v) counts the number of 
co-ordinates where b and v differ. When a word b € BN is transmitted through a noisy channel, we 
receive an altered word v. The Hamming distance d(b, v) counts the number of letters that have been 
altered. 


7.1 Decoding with Errors 


Suppose that a word b € BN is sent through a noisy channel and received as v € BN. We then 
wish to decode this. Because of the errors introduced by the channel, the word we receive may not be 
a code word at all and certainly not the code word 6 that was sent. How should we guess b when we 
know v? 


We will consider, briefly, three possible decoding rules. Suppose that v € BN has been received. 


Ideal Observer 
Decode v as the word b with P(b sent | v received ) maximal. This is a sensible choice if we know 
enough to find the maximum. In most cases we do not. 


Maximum Likelihood 
Decode v as the word b with P(v received | b sent ) maximal. This is usually easier to find. 


Minimum Distance 
Decode v as the word 6 with d(b, v) minimal. 


Proposition 7.1 
If all the code words are equally likely, then the ideal observer and maximum likelihood rules give the 
same result. 


Proof: 
We know that 


P(wv received and b sent ) 
P(b sent ) 


P(wv received ) 
P(b sent) ~ 


P(wv received | b sent ) = = P(b sent | v received ) 


So, if all the code words are equally likely, then the ideal observer rule and the maximum likelihood rule 
give the same result. 


Proposition 7.2 
If letters are transmitted through a binary symmetric channel with error probability p < 4, then the 
maximum likelihood and minimum distance rules give the same result. 
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Proof: 


N d 
P(wv received | b sent ) = II P(v; received |b; sent ) = p“(1— p)%~¢ = (1— p)% (4) 
j=l 


where d is the number of terms where v; # b;. Hence d is the Hamming distance d(v,b). Since 
0< p/(1—p) <1 we see that P(wv received | b sent ) is maximal when d(b, v) is minimal. 


We will generally use the minimum distance decoding rule. 


7.2 Error Detecting and Error Correcting 


Throughout this section we will use binary codes c: A + {0,1}%. This is for simplicity since the 
results can easily be adapted to more general alphabets. Recall that the code words are the words c(a) 
for a € A. We will write kK for the number of these K = |A|. If a word b = c(a) is sent, then we 
receive a word v € {0,1}%. We decode this by finding the code word c(a’) closest to v, so d(c(a’), v) is 
minimal. 

The ball of radius r centred on b € {0,1} is 

B(b,r) = {v € {0,1}% : d(b,v) <r}. 


Its volume is the number of points that it contains: 


|B(b,r)| = S° ee 


O<k<r 


This is clearly independent of the centre b and we will denote it by V(N,r). It is reasonably simple to 
estimate the size of V(.N,r) and we will do so later. 


The code c: A — {0,1}% is e-error detecting if, whenever no more than e letters in a code word 
c(a) are altered, the received word is not a different code word c(a’). This means that we can tell if 
there have been errors, provided that there are at most e of them. 


The code c: A — {0,1}% is e-error correcting if, whenever no more than e letters in a code word 
c(a) are altered, the received word is still decoded as a using the minimum distance decoding rule. 


The minimum distance for a code c: A > {0,1}% is 
6 =min{d(c(a),c(a’)):a,a°.€ A witha4#a’} . 


For this we immediately have: 


Proposition 7.3 Minimum distance for a code 
Let the code c: A— {0,1}* have minimum distance 5 > 0. 


(a) c is (6 — 1)-error detecting and not 6-error detecting. 


(b) ¢ is |$(6 — 1)]-error correcting but not |5(6 + 1)]-error correcting 
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Note that [$(6+1)] =[$(6-)] + 1. 

Proof: 

(a) Clearly no code word other than c(a) can lie within a distance 6 — 1 of c(a). However, there are 
two letters a1,a2 € A with d(c(a1), c(az)) = 6, so c is not d-error detecting. 

(b) If a’ £a, and d(c(a),v) < [$(6— 1)], then the triangle inequality gives 

d(c(a’),v) > d(c(a’), c(a)) — d(c(a), v) > 8 — |3(8-1)] > [36-1] . 

So c is |5(6—1)]-error detecting. However, there are two letters a1, a2 € A with d(c(a1), c(az)) = 46, 
so c(a1) and c(az) differ in exactly 6 places. Choose [5(6 + 1)] of these places and change c(a1) in 


these to get a word v with 


d(v,e(a)) =[4(6+)] and —d(v, (az) = 5 — [46+] < [46+D). 


So c can not be [5(6 + 1)]-error correcting. 


Example: 
For the repetition code, where each letter is repeated m-times, the minimum distance is m, so this code 
is (m — 1)-error detecting and |4(m — 1)]-error correcting. 


For a code A + {0,1} where the Nth bit is a check digit, the minimum distance is 2, so this code 
is 1-error detecting and 0-error correcting. 


For the Hamming code h : {0,1}4 — {0,1}’, the minimum distance is 3, so the Hamming code is 
2-error detecting and l-error correcting. 


7.3 Covering Estimates 


Observe that the code c: A > {0,1} is e-error correcting precisely when the balls B(c(a), e + 1) 
for a € A are disjoint. We can use this to estimate the number of code words for such a code. 


Proposition 7.4 Hamming’s bound 
Ifc: A {0,1} is an e-error correcting code, then the number of code words K = |A| must satisfy 
QN 


a 
V(N,e+1) 


Proof: 
First note that 


B(c,e +1) = {v € {0,1}% : d(e,v) <e+ 1} = {v € {0,1}% : d(e,v) < e} 
so the open ball B(c,e + 1) is the closed ball of radius e. 


When c is e-error correcting, the balls B(c(a),e +1) for a € A must be disjoint. Hence their total 
volume KV(N,e +1) is at most equal to the volume of all of {0,1}, which is 2%. 
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A code is said to be perfect if it is e-error correcting for some e and the balls B(c(a),e + 1) cover 
all of {0,1}%. This is the case when there is equality in Hamming’s bound. For example, Hamming’s 
code is l-error correcting, K = 24, N = 7 and 


ve.ay=(*)4 ("as 


So K = 24 = 27/8 = 2 /V(7,2) and the code is perfect. 


We can also use a similar idea to give a lower bound. 


Proposition 7.5 Gilbert — Shannon - Varshamov bound 
There is an e-error detecting code c: A — {0,1} with the number of code words K = |A| satisfying 
Qn 


Kk >... 
V(N,e+1) 


Proot: 
For an e-error detecting code, no code word other than c(a) can lie within the ball B(c(a),e+1). 
Choose a maximal set C of code words satisfying this condition and let K = |C|. If there were any word 


v not in the union 
U B(c,e+1) 
ceEC 
then we could add it to C, contradicting maximality. Therefore, 
J Ble,e +1) = {0,1} . 
ce 


The volume of each ball is V(N,e + 1), so the volume of their union is no more than KV(N,e + 1). 
Hence KV(N,e +1) > 2%. 


Corollary 7.6 
Let c: A= {0,1} be a code with minimum distance 5. Then 
Qn 
(N,|3041)]) 
Moreover, there is a code with minimum distance 6 and 
Qn 


V(N,6) | 


K=|Al< 
<3 


K=|A|> 


7.4 Asymptotics for V(N,r) 


We wish to use the Hamming and Gilbert — Shannon — Varshamov bounds of the last section to 
give asymptotic estimates on how large error correcting codes are. To do this we need to determine the 
asymptotic behaviour of the volumes V(N,1r). It will be useful to describe this in terms of the entropy 
of a Bernoulli random variable that takes two values with probability g and 1 — gq. We will denote this 
by 

h(q) = —qlogs q— (1 — q) loga(1 — q) . 


Lemma 7.7  Asymptotics of binomial coefficients 
For natural numbers 0 <r < N we have 


1 


ght) < (* ) < 2NhCa) 
N+1 


r 


where q=r/N. 
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Proof: 
Let X be a Bernoulli B(N,q) random variable, so 


P(X =k) = (j, ahaa" | 


(The entropy of X is H(X) = Nh(q).) It is simple to check that this probability is maximal when 
k=r=qN. Therefore, 


This means that 


1 N N 
— = < WX =r) = "a1 —q)N" = g Nia) <i, 
N+1 \e =F) (Yar q) (*) 


The result follows when we observe that 


9—Nh(q) = qh4(1 _ gNG-9 _ q’(1 = ried . 


Exercise: 7 ag 
Use Stirling’s formula c ~ Vim | to describe the behaviour of ( ) in terms of the entropy 
e r 


h(r/N). 


Proposition 7.8 Asymptotics of V(N,r) 
For a natural number N and 0 <r < 3N 


V(N,r) < 2Nh(@) 
for qg=r/N and 


for 7 = ({r] —1)/N. 


Proof: 
V(N,r) is the number of points in an open ball of radius r in {0,1}, so 


VN,r)= >) Gr 


O<k<r 
Sinceg=r/N < 4, we have q*(1— q)N-* < g’(1—q)X~ for0<k <r. So 
N\ 2 N-k N\ i N-k 
= DY ({)fa-arts DO (T)ea-o 
O<k<N 0<k<r 


This shows that V(N,r) < 2?(), 


For the other inequality, let k = [r] — 1 so k is the largest integer strictly smaller than 7. Then 


V(N,r) > e > 1 ann’) 


because of |Lemma 7.7|. 
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Consider a code c: A {0,1}. If A is a a random variable taking values in the alphabet A then 
information is transmitted through the channel at a rate 


H(A) 


N 


This is largest when A is equally distributed over the K possible values. Then it is 
logs K 
NO 
Choose q with 0 <q < 4. Hamming’s bound shows that 


log, K 4 log, V(N, qN) 
N * N 


for (¢N — 1)-error correcting codes. The last proposition now shows that the right side tends to 1—h(q) 
as N + oo. 


Similarly, the Gilbert — Shannon — Varshamov bound |Proposition 7.5|shows that there are (¢N—1)- 
error detecting codes with information rate 


log, K 4 log, V(N, qN) 


N N 


and the right side tends to 1 — h(q) as N > ov. 
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8: INFORMATION CAPACITY 


8.1 Mutual Information 


We wish to determine how much information can be passed successfully through a noisy channel 
for each bit that is transmitted. Consider a time-independent, memoryless channel. If a random letter 
B € B is sent through the channel, then a new random letter B’ is received. Because the channel is 
noisy, B’ may differ from B. 


The average amount of information given by B is the entropy H(B), while the average amount of 
information received is H(B’). The information received is partly due to B and partly due to the noise. 
The conditional entropy H(B’|B) is the amount of information in B’ conditional on the value of B, so 
it is the information due to the noise in the channel. The remaining information 


H(B’) — H(B’|B) 


is the part that is due to the letter B that was sent. It is the amount of uncertainty about B’ that is 
removed by knowing B. We therefore define the mutual information of B and B' to be 


I(B’, B) = H(B') — H(B'|B) . 


Proposition 8.1 Mutual Information 
The mutual information satisfies: 


I(B', B) = H(B') + H(B) — H(B’, B) =1(B,B’). 

Furthermore, 
(a) I(B’, B) > 0 with equality if and only if B’ and B are independent. 
(b) I(B', B) < H(B’) with equality if and only if B’ is a function of B. 


(c) I(B', B) < H(B) with equality if and only if B is a function of B’. 


Proof: 
Recall from Lecture 2 that the conditional entropy satisfies H(B’|B) = H(B’, B) — H(B), so 


1(B', B) = H(B') — H(B'|B) = H(B’) — H(B', B)+ H(B) 
and the left side is clearly symmetric. 
(a) |Corollary 2.2| shows that [(B’, B) > 0 with equality if and only if B and B’ are independent. 


b) We always have H(B’|B) > 0 with equality if and only if B’ is a function of B. 
y y 


(c) Since [(B’, B) = I(B, B’) part (c) follows from (b). 
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In our case, the distribution P(B = 6) is known as are the transition probabilities P(B’ = b'|B = Db). 
From these we can calculate both 


P(B’ =0',B =b) = P(B’ =0'|B = d)P(B =) 


and 


P(B’ =0') =) P(B’ =b'|B =db)P(B=b). 
beB 


Thence we can find the entropies H(B), H(B’), H(B’, B) and the mutual information. If the channel 
has no noise, then B’ = B and the proposition shows that the mutual information is just the entropy 
H(B). 


The information capacity of the channel is the supremum of the mutual information [(B’, B) 
over all probability distributions for B. The probability distribution of B is given by a point p = 
(p(a1), p(az),-..,p(Ax)) in the compact set {p € [0,1}* : Sp, = 1}. The mutual information is 
a continuous function of p, so we know that the supremum is attained for some distribution. This 
information capacity depends only on the transition probabilities P(B’|B). 


Proposition 8.2 Information capacity of a BSC 
A binary symmetric channel with probability p of error has information capacity 1 — h(p). 


Proof: 
Suppose that the letter B has the distribution P(B = 1) =t, P(B =0)=1-t. Then the 
entropy of B is 


H(B) = H(1-t,t) = —-tlog,t — (1 —t) log,(1 — t) = A(t) . 
The conditional entropy is 
A(B'|B) =P(B =1)H(B'|B =1)+ P(B = 0)H(B'|B = 0) 


=tH(1—p,p)+(1—t)H(p,1—p) =h(p) . 


The distribution of B’ is 


P(B’ =1)=t(l—p)+(1-t)p=t+p-—2tp, P(B’ =0) =tp+(1-t)(1—p) =1—t—p+ 2tp 
so its entropy is 
H(B’) = h(t + p— 2tp) . 
Therefore the mutual information is 
I(B', B) = H(B’) — H(B'|B) = h(t + p— 2tp) — h(p) . 
We can choose any distribution for B and so take any t between 0 and 1. The maximum value for 


h(t +p — 2tp) is 1 which is attained when t = 4 and t+ p—2tp = 5: So the information capacity of the 
binary symmetric channel is 1 — h(p). 
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Capacity of a BSC(p) 
A 


0.8 — 


0.4 
0.2 + 


0 | | > p 
0 0.5 1 


Note rae when p = 0 or 1, the channel transmits perfectly and the information capacity is 1. 
When p = 5, no information is teansenitted and the capacity is 0. 


We often wish to consider using the same channel repeatedly, for example, when we wish to transmit 
a word in {0,1} by sending the N bits one at a time through a binary symmetric channel. We can 
also think of this as having N copies of the channel in parallel and sending the jth bit through the jth 
channel. It is simple to compute the capacity in such a situation. 


Proposition 8.3 The capacity of parallel channels 

Let Q be a channel that transmits letters from the alphabet B and has capacity Cap(Q). Form a new 
channel Q™ that transmits an N-tuple (b,,b2,...,by) € B™ one letter at a time through the channel 
Q with each use being independent of the others. Then the capacity of QN is NCap(Q). 


Proof: 
Let B = (B,, Bo,..., Bn) be a random variable taking values in the product alphabet BN and 
let B’ = (Bi, B4,..., By) be the random variable we receive after sending B through the channel QY. 
For each vector b = (b;,b2,...,bn) we know that 


H(B'|B =b) =)_ (Bi |B; =) 
because, once we condition on the B, each B; is independent of all the other letters received. Therefore, 
H(B'|B) = )_ H(Bi|B;) 


Also, 
B') <)>" H(B)) 


with equality if and only if the (Bj) are independent. 


Thus 


I 
2 


N 
1(B’, B) = H(B’')— H(B'|B) < 2d HB; H(Bi|B;) I(Bi, B;) 


a. 
Il 
nn 
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and there is equality if and only if the (BY) are independent. This shows that Cap(Q”) < NCap(Q). 


Now choose the distribution of B; so that Cap(Q) = (Bi, B;) and the (B;) are independent. Then 
the (B;) are also independent and 
1(B’, B) = NCap(Q) 
so the product channel Q™ has capacity NCap(Q). 


Exercise: 
Use a similar argument to show that the capacity of independent channels Q; in parallel is }> Cap(Q;). 


Suppose that we send a random variable X through a channel and receive Y with mutual informa- 
tion I(X,Y). If we apply a function f to Y then it is intuitively clear that the information transmitted 
from X to f(Y) is no larger than that transmitted from X to Y. The next Proposition proves this 
rigorously. 


Proposition 8.4 Functions do not increase mutual information 
Let X,Y be random variable and f a function defined on the values taken by Y, then 


IF SIO) 
Similarly, I(f(Y),X) < I(Y,X). 


Proof: 
We know from |Corollary 2.2| that H(A,B) < H(A) + H(B). Applying this to the random 


variables conditioned on the value of Z gives 
A(X,Y|Z) < H(X|Z)+ A(Y|Z) . 
This is equivalent to 
H(X,Y,Z) — H(Z) < H(X,Z) — H(Z) + HY, 2) - H(Z) 
& —-H(X) + H(Z) - H(X,Z) < H(X) + HY, Z) - H(XY, 2) 
@ = I(X,Z) < 1(X,(Y, Z)) 
If we set Z = f(Y), then H(Y,Z) = A(Y) ‘cis A(X,Y,Z) = H(X,Y) so I(X,(Y,Z)) = 1(X,Y). Thus 
we have 


val 
HT 


I(X, f(Y)) < (X,Y) 


as required. 


Since [(X,Y) is symmetric (Proposition 8.1}), we also have I(X, f(Y)) < (X,Y). 


Exercise: 

Data Processing Inequality 

Consider two independent channels in series. A random variable X is sent through channel 1 and 
received as Y. This is then sent through channel 2 and received as Z. Prove that I(X,Z) < I(X,Y), 
so the further processing of the second channel can only reduce the mutual information. (The last 
proposition established this when the second channel has no noise.) 


The independence of the channels means that, if we condition on the value of Y, then (X|Y = y) 
and (Z|Y = y) are independent. Deduce that 
H(X,Z|Y) = H(X|Y)+ H(Z|Y). 


By writing the conditional entropies as H(A|B) = H(A, B) — H(B), show that 
H(X,Y,Z)+H(Z) = H(X,Y)+ H(Y,2) . 


Define [(X, Z|Y) as H(X|Y) + H(Z|Y) — H(X, Z|Y) and show that 
I(X, ZY) =1(X,Y)-—1(xX,Z). 
Deduce from this the data processing inequality: 
I(X,Z) <1(X,Y). 


When is there equality? 
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9: FANO’S INEQUALITY 


9.1 A Model for Noisy Coding 


We will, eventually, prove Shannon’s Coding Theorem which shows that we can find codes that 
transmit messages through a noisy channel with small probability of error and at a rate close to the 
information capacity of the channel. These are very good codes. Unfortunately, Shannon’s argument is 
not constructive and it has proved very difficult to construct such good codes. In the remainder of this 
section we will show that the rate at which a code transmits information can not exceed the information 
capacity of the channel without the probability of error being large. 


The situation we face is as follows. We wish to transmit a message written in the alphabet A of 
size kX. Let A be a random variable that takes values in A. Then H(A) < log, K with equality when 
all K letters in A are equally likely. We will assume that this is the case. 


We choose a constant length code c: A + B% to encode the message, one letter at a time. This 
gives a random variable B = c(A) taking values in B.. Since c is assumed to be invertible, A determines 
B and vice versa. Therefore H(A) = H(B). 


Each bit of c(a) is then passed through a channel where errors may arise. This results in a new 
string c(a)’ € BN and we set B’ = c(A)’. The closest code word to c(a)! will be denoted by c(a’) for 
some a’ € A and we then decode it as a’. The random variable A’ results from decoding B’. Since A’ 
is a function of B’, we certainly have H(A’) < H(B’). 


The probability of error when the letter a is sent is 
P( error | a sent ) 
and the total probability of error is 
P( error ) = Sy P( error | a sent )P(A =a) . 
acA 


We would like this to be small. Indeed, we would really like the probability to be small over all choices 
of a € A. So we want the maximum error 


é(c) = max {P( error | a sent ): a € A} 


to be small. 


The rate of transmission for c is 
H(A) log, Kk 
plo) = = RE 
N N 
We want this to be as large as possible. Our aim is to relate this to the information capacity of the 
channel. 


SENDER CHANNEL RECEIVER 
A x BN ----------- = ial —— A 
Random 
Variables A B Cap B' A! 
Entropy H(A) = H(B) H(B’) > H(A’) 
log, 


Note that the particular code c : A > B% does not really matter. We can permute the code 
words {c(a) : a € A} in any way we wish and not alter the rate of tranmission or the maximum error 
probability. So, the set of code words is really more important than the actual alphabet or code we 
choose. We sometimes refer to the set of code words {c(a) : a € A} as the code book. 
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9.2 Fano’s Inequality 


Lemma 9.1 
Let X be a random variable that takes values in a finite alphabet A of size K. Then, for any fixed letter 
a € A, we have 

A(X) < ploga(K — 1) + h(p) 


where p= P(X 4a). 
Proof: 
Recall that 
A(X,I)=HA(X|I)+ A). 
Let J be the indicator random variable 
r={) when X 4a; 
0 when X =a. 
Then we see that: 
H(X,I) = H(X) 
because I is determined by the values of X. 
A(X|1) =PU =0)A(X|J = 0) + PU = 1)A(X|I = 1) 
= (1—p)0+ pH(X|I = 1) 
< plogs(K — 1) 


because, if J = 0 then X = a and so H(X|I = 0) = 0, while if J = 1, there are only K — 1 possible 
values for X. 


Therefore, we have 
A(X) = A(X|I) + A(L) 


< plogs(K — 1) + h(p) 


because H(I) = H(1 — p,p) = h(p). 


Theorem 9.2 Fano’s inequality 
Let X,Y be two random variables that take values in a finite alphabet A of size K. Then 


A(X|Y) < p logo(K — 1) + h(p) 
where p= P(X £Y). 


We will apply this where X takes values in the alphabet A and Y is the result of passing the code word 
c(X) through a channel and decoding it. The probability p is then the probability of error. 


Proot: 
Condition the inequality in the Lemma on the value of Y to obtain 


A(X|Y = y) < P(X #yl¥ = y) logs(K — 1) + HUY = y) 
where I is given by 


eae when X #Y; 
O when X =Y. 


If we multiply this by P(Y = y) and sum over all values for y we obtain 


A(X|Y) < P(X ZY) log,(K — 1) + HUY) = plog,(K —-1)+ H(I|Y). 


Finally, recall that H(1|Y) = HU,Y) — H(Y) < H(J) by [Corollary 2.2}. 
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We can think of the two terms of Fano’s inequality as h(p) = H(J) measuring the information given 
by knowing whether there is an error and plog,(/K — 1) measuring the information on what that error 
is, when it occurs. 


We can use Fano’s inequality to compare the rate of transmission of information to the capacity of 
the channel. As always we will use our model for a noisy channel. A random letter A is chosen from 
A and encoded as the B’-valued random variable B = c(A). This is transmitted, one letter at a time, 
through the noisy channel and received as the B-valued random variable B’ which is decoded to give 
A’. The rate of transmission is p(c) = log, K/N. 


Choose A so that each letter in A has probability 1/K. Fano’s inequality gives 
H(A|A’) < h(p) + plog,(K — 1) < h(p) + plog, K . 


So we have 
1(A', A) = H(A) — H(AlA’) > logy K — (h(p) + plogy K) - 


Now the code c: A BN is injective, so A is a function of B. Therefore|Proposition 8.4] show that 
I(A’, A) < I(A’, B) . 
Similarly, A’ is a function of B’ — the decoding function. So 
I(A’, B) < 1(B',B). 
Therefore, [(A’, A) < I(.B’,B) < NCap and hence 
NCap 2 (1 — p) log, K — h(p) . 


Dividing by N we obtain 
h 
cap(q) + MP) > (1 pple). 


Theorem 9.3 Capacity bounds the rate of transmission of information 
Let c: A + B% where each letter of a code word is transmitted through a time-independent memo- 
ryless channel with capacity Cap. Let p be the probability of decoding incorrectly. Then the rate of 
transmission satisfies 
atax logy K g Cap h(p) 
N “1-p NQ-p) 


As we let the probability of error decrease to 0, so 1—p 71 and h(p) \, 0. Therefore 


Cap hip) 
1-p N(1-p) 


\y Cap . 


Hence, if we find codes c; with probabilities of error p; that decrease to 0, and rates of tranmission that 
tend to a limit p, then p must be at most the information capacity of the channel. 
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10: SHANNON’S NOISY CODING THEOREM 


10.1 Introduction 


In this section we will prove Shannon’s Noisy Coding Theorem, which shows that we can find codes 
with rate of tranmission close to the capacity of a channel and small probability of error. We will do 
this only in the case of a binary symmetric channel. The arguments are simpler in this case but the 
principles apply more generally. 


Theorem 10.1  Shannon’s Noisy Coding Theorem 

For ¢ > 0 and a binary symmetric channel with capacity Cap there are codes 
cn: An > {0, iy 

for N sufficiently large with rate of tranmission 


log, |A 
ple) = S8214NI 5 Cap —« 


and maximum error 
e(cn) <e. 


Example: 
We wish to send a message m from an alphabet A of size K through a binary symmetric channel with 
error probability p = 0.1. What rate of transmission can we achieve with small probability of error? 


Cap = 1— h(0.1) = 0.53, so Shannon’s Theorem shows that we can achieve any rate of tranmission 
strictly less than this with arbitrarily small probability of error. For example, suppose that we want a rate 
of tranmission p > 0.5 and é < 0.01. Shannon’s Theorem shows that there are codes cy : An > {0,1}% 
that achieve this provided that N is large enough, say N > No. 


Suppose that we know such a code cy. How do we encode the message m? First divide m into 
blocks of length B where 


ON 
B= | | so that AS) Se ee 
logy K 


Then we can embed the blocks from A? in the alphabet Ay and so encode the blocks. The rate of 
tranmission is zs 
] 
logs |AV] 0.5. 
N 
Unfortunately, Shannon’s Theorem tells us that there are such codes but does not tell us how to 
find them and it is very difficult to do so. 


10.2 Chebyshev’s Inequality 


Theorem 10.2 Chebyshev’s inequality 
Let X be a real-valued, discrete random variable with mean EX and variance var(X). Then 


var(X) 


P(|X —EX|21) < —} 
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(The inequality is true whenever the variance of X exists, even if X is not discrete.) 


Proof: 
We have 


var(X) = E 


X -EX? = 5 P(X =2)|2 -EX/? 


>> {P(X = 2)t? : ja —EX| > t} = PP(|X -EX| > 1). 


So we obtain the result. 


10.3 Proof of the Noisy Coding Theorem 


To prove Shannon’s Theorem (‘Theorem 10.1) we need to find code words in {0,1}. Choose these 
at random and independently of one another and of the channel. We will prove that, on average, this 
gives the inequalities we want. Then there must be at least some of the choices for code words that also 
give the inequalities. 


Let the binary symmetric channel have error probability p with 0 < p< $. Then Cap = 1 — h(p). 
Set 
oa a a Ie eal 


and let Ay be an alphabet with Ky letters. 


Choose a letter a, € Ay and then choose a random code word as cn(ao) uniformly from the 2% 
words in {0,1}. These choices must be independent for each different letter and also independent of 
the channel. When a fixed word c, = cy (ao) is sent through the channel, it is corrupted and a new 
word c’, is received. We want this received word to be close to €,. 


The Hamming distance d(c,c’) is a Bernoulli B(.N,p) random variable, so it has mean Np and 
variance Np(1—p). Therefore Chebyshev’s inequality gives 


Np(1 — p) 


P(d(€o, 4) 2r)< (— Np? 


for r > Np. Take q > p and set r = Nq, then 


P(d(co,c,) > 17) 


JIN 


(1). 


Now consider another code word c = cy(a) for a letter a # a,. This is chosen uniformly and 
randomly from the 2% words in {0,1}, so|Proposition 7.8} shows that 


V(N,r) 28M ante) | 


Pie) <r=Pee Be r= QN SON 


The probability that d(c,c/,) < r for at least one of the Ky —1 code words c not equal to c, is therefore 
at most 
KyQ7N@-ha)) = 9N(Cap—e—(1=A(a)) 
Now Cap = 1 — h(p), so this gives 
P( there exists a # a, with d(cn(a),c’,) <r) < 2NM@D-h)-*) (2). 
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We will use the minimum distance decoding rule. So we make an error and decode cy = cy (do) as 
another letter a #4 a, only when either d(c.,c/,) > r or d(cn(a),ci,) < r for some a # ao. Hence the 
probability of an error is 


p(1—p) N(h(q)—h(p)—e) 
P( error ) << ————_ + 2 q P)~e 
(omer) S Wig— pp? 


because of (1) and (2). Choose g with 


This is certainly possible for 0 < p < 4. Then 
pil ~ P) < Le and Qn (h(q)—-h(p)—€) 2 le 
N(q—p) 


for all sufficiently large N. For such N we see that the probability of error is strictly less than e. 


This is the error averaged over all of the random choices of code words. This means that there 
must be at least one choice of the code words that also has P( error ) < ¢. Take this as the code cy. 


We have almost proved the result we sought except that we have proved that the mean error 
P( error ) rather than that the maximum error €(cy) = max{P( error | a sent ): a € A} is less than e. 


For our code cy the average error probability satisfies 


1 


aa S- P( error |asent) <é 
N 


acAn 
so, for at least half of the letters in Ay we must have 
P( error | a sent ) < 2e. 
Choose a new alphabet Ay that consists just of these $/y letters. Then we certainly have 
P( error | a sent ) < 2e for a € Aly 
so the maximum error is at most 2¢. Furthermore, 
|A’y| = 4|An| = gN(Cap—e)—1 


so the rate of transmission is ; A i 
e) 
S2 | a a ee 
N N 
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10.4 * Typical Sequences * 


In this section we want to reinterpret the entropy H(A) in terms of the probability of a “typical 
sequence” of values arising from a sequence of independent random variables (A,,) all with the same 
distribution as A. 


As a motivating example, consider tossing an unfair coin with probability p of heads. Let A, An 
be the results of independent coin tosses. If we toss this coin a large number N times, we expect to get 
approximately pN heads and (1 — p)N tails. The probability of any particular sequence of pN heads 
and (1 —p)N tails is 


pPN (1 — p)G-P)N — QN(P logs p+(1—p) logs(1—p)) _ g-NH(A) | 


Of course not every sequence of coin tosses will be like this. It is quite possible to get extraordinary 
sequences like all N heads. However, there is only a small probability of getting such an atypical 


sequence. With high probability we will get a typical sequence and its probability will be close to 
Q-NH(A). 


To make this precise we need to recall the weak law of large numbers from the probability course. 
This follows from Chebyshev’s inequality. 


We say that a sequence of random variables S,, converge to a random variable L in probability if 
P(|S, —L| >e) 70 as n+ oO 


for every ¢ > 0. This means that S, and L can still take very different values for large n but only on a 
set with small probability. 


Theorem 10.3. The weak law of large numbers 
Let X be a discrete real-valued random variable. Let (X,,) be independent random variables each with 
the same distribution as X. The averages: 


Xt Xat... + Xn 
nm 


Sn 


then converge in probability to the constant EX as n — oo. 


Proof: 
We need to show that 
P (|S, —EX| >t) 0 as nm—-+>oo. 
Note that x - a 
—_ 1A, +H Ag+...+1 n _py 
n 
and 
outs ja var(X1) + yer) +...+var(X,) _ var(X) 
n n 
Chebyshev’s inequality gives 
var(S;,) 
P (|Sn — Srl Sty< 2 
so we have Xx 
var 
P(|S, —-EX|>t)< 
( 29< 35 


which certainly tends to 0 as n > oo. 
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Let A be a random variable that takes values in the finite set A. Let (A,,) be a sequence of 
independent random variables each with the same distribution as A. Recall that the entropy of A is 


H(A) =— 5 P(A =a) log, P(A =a) . 
acA 
This is the expectation of the real-valued random variable: 
X(w) = —log, P(A = A(w)) . 


(In other words, if A takes the values a, with probabilities p;, then X takes the values — log, pz with the 
same probabilities.) In a similar way we define X,, from A,,, so the random variables X,, are independent 
and each has the same distribution as X. 


The weak law of large numbers |'Theorem 10.3} shows that 


Xi, + Xo+...Xn 
nm 


> EX in probability as n + oo. 


Hence, for any € > 0, there is an N(e) with 
Xi +Xo+...Xn 
P ( 1 we 


n 


:x|>e) <eE 
for n > N(e). 


Consider a particular sequence of values (a; fy from A. The probability of obtaining this sequence 
as (A,;) is 


P(A; =a; for f= 1,2;.1,n) =] Pi A=). 
Also, if we do have A; = aj, then 


Xy+Xot+...Xn 


= Slog, P(A = a;) = — logy II P(A =a,;) 
j=l 


n . 
j=l 
and therefore, 
Xy+Xot+...Xn 
ee. IX| <€ 
n 
is equivalent to 
Qa t ee P(A; =a, for 7 =1,2,...,n) < 7 al ae a * 


Thus we see that the probability of obtaining a sequence (a;) for which (*) holds is at least 1 — € for 
n > N(e). These are the “typical sequences”. Thus we have proved: 


Theorem 10.4 Asymptotic equipartition property 

Let A be a random variable taking values in the finite set A and with entropy H(A). Let (A,) be a 
sequence of independent random variables each with the same distribution as A. For each < > 0, there 
is a natural number N(e¢) such that the following conditions hold for any n > N(e). 


There is a set T of sequences in A” with 
P((Ap)p4 eT) > 1—e 
and, for each sequence (a;) € T, 


yale (Alte) Pp (A; =a, for j =1,2,...,n)< g-MH(A)=e) | * 


We call the sequences in T' the “typical ones” while those outside are “atypical”. The sequences 
in T all have approximately the same probability 2-"”) of arising, so we say that the sequence of 
random variable (A,,) has the asymptotic equipartition property. 
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11: LINEAR CODES 


It is useful to use the algebra you have learnt in earlier courses to describe codes. In particular, if 
we are dealing with an alphabet of size q = p* for some prime number p, then there is a field Fy with 
exactly q elements and we can use this to represent the alphabet. The most important case is when 
q = 2, or a power of 2, and we have seen this already when we looked at Hamming’s code. 


11.1 Finite Fields 


In this section we will recall results from the Numbers and Sets, Linear Algebra, and Groups, Rings 
and Modules courses. For this course, it will be sufficient to use only the integers modulo 2, F2 or 
occasionally the fields Fyx for K > 1. 


There are fields with a finite number q of elements if and only if g is a prime power q = p". These 
fields are unique, up to isomorphism, and will be denoted by F,. 


Example: 
For each prime number p, the integers modulo p form a field. So F, = Z/pZ. 


The non-zero elements of a finite field F, form a commutative group denoted by Fj. This is a cyclic 
group of order g—1. Any generator of Fj is called a primitive element for Fg. Let a be a primitive 
element. Then the other elements of the group Fy are the powers a® for k = 1,2,3,...,q¢—1. These 

_ 


are all distinct. The order of the element a” is @-1,5 
q—1, 


, so the number of primitive elements is given 


by Euler’s totient function 


p(q—1) =|{k:k =1,2,3,...,q—1 and (¢—1,k) =1}|. 


Example: 
For the integers modulo 7 we have 3 and 5 as the only primitive elements. 


We will want to work with alphabets that are finite dimensional vector spaces over the finite fields 
F,. For example, eg is an N dimensional vector space over Fz. Any N-dimensional vector space has 
q’ vectors. The scalar product on FY is given by 


N 
a= > wats : 
n=1 


(Every linear map A : FW — Fy, is of the form «+> w-y for some y € ny So the scalar product 
gives us a way to identify the dual space (F)’)* with F%’.) For any finite set S, the set of all functions 
V =F? = {f : S + F,} is a vector space over Fy with dimension |S]. 


A polynomial over Fg is a formal expression: 
A(X) = ao Ga bie Xk by. eae 


with the coefficients a, € Fy. In this expression, X is an indeterminate or formal parameter rather than 
an element of the field. The degree deg(A) of the polynomial A is the largest n with a, 4 0. The degree 
of the 0 polynomial is defined to be —oo. Define addition and multiplication of polynomials as usual. 
This makes the set F,[X] of all polynomials over F, a ring. The degree is an Euclidean function for this 
ring and makes it an Euclidean domain: For polynomials A, D € F,[X] with D ¥ 0, there are uniquely 
determined polynomials Q and R with 


A=Q.D+R and deg(R) < deg(D) . 
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This has many useful consequences. We have the division criterion: 
(X —a)|A Ss A(a) = 0 


for an element a € Fy. Furthermore, the polynomial ring is a principal ideal domain. For, if J is an ideal 
in F,[X] (written J <F,[X]), then either J = {0} or there is a non-zero polynomial D € I of smallest 
degree. For any other polynomial A € I we can write 


A=Q.D+R with deg(R) < deg(D) . 


However, R € I, so R must be the 0 polynomial. Therefore A = Q.D. Hence every ideal is of the form 
DF,[X] for some polynomial D. We will denote this ideal by (D). 


For any ideal J = DF,[|X], the quotient F,[X]/J is a ring and there is a quotient ring homomorphism 
F,[X] > F,[X]/I. The quotient is a vector space over F, with dimension deg(D). 


A very important example of this is when D is the polynomial X”— 1 and we consider the quotient 
F,[X]/(X”" — 1). If we divide any polynomial A € F,[X] by X” — 1 we obtain a remainder of degree 
strictly less than n. So each coset A+ (X” — 1)F,[X] contains an unique polynomial R of degree at 
most n — 1. Hence we can represent the quotient F,[X]/(X” — 1) by 


{Ag Wy XE sa Pgh” titg fay. 34 40,-7 € Fy} 


Multiplication in the quotient F,[X]/(X” — 1) then corresponds to multiplying the polynomials and 
reducing the result modulo X” — 1, that is replacing any power X* for k > n by X™ where k = m 
(mod n). Note that, in this case, the quotient F,[X]/(X” — 1) is a vector space of degree n. 


11.2 Linear Codes 


Suppose that we are looking at codes c: A > B™. If both of the alphabets A and B have orders 
that are powers of the same prime power q, then we can represent both as vector spaces over Fy. So ¢ 
is a map Fe > Fy for suitable dimensions K,N. Such a map is much easier to specify if it is linear, 
for then we need only give the images of a basis for Fe, This is in itself a considerable advantage and 
makes it much simpler to implement the encoding and decoding of messages. 


We will write F for any finite field, so F = Fy for some prime power qg. A code is linear if it is an 
injective linear map c: F* —> FX. The image of ¢ is then a vector subspace of F, called the code book 
of c. Its dimension is the rank of c. 


Example: 
Hamming’s code was a linear map F5 > F%. 


A linear code c: F* + F% is given by an N x K matrix C with entries in F. Soc: ++ Ca. Since 
c is injective, the matrix C has nullity 0 and rank AK. As usual, we are really only interested in the code 
book. This can be specified by giving any basis for the image. 


The columns of the matrix C' are one basis of code words for the code book. We can apply column 
operations to these to reduce the matrix to echelon form. For example 


* *¥ *¥* ¥ RP OO 
¥rRooOoCceco 
FPoOcoocoo°o 


* *¥ %¥ %¥ *%¥ % = 
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where * denotes an arbitrary but unimportant value. Applying column operations to this we can obtain 
a matrix: 


0 


ao x* * © * 
o* * FEF Oo 
rFPoOoOoCceoo 


0 
0 
0 
0 
0 
1 


0 0 0 


The columns of this matrix are still a basis for the code book. If we re-order the rows of this matrix, 
which corresponds to re-ordering the components of F%, then we obtain a matrix of the form 


ee FF COO So = 
*¥ * * OFOrF OO 
Foe ee OF oO S&S 
*¥* * ¥ FP OOO 


In general, when we apply these operations to our code matrix C' we change it to the form 


where J is the K x K identity matrix and A is an arbitrary (N — K) x K matrix. Whenever we need to 
we can reduce our linear code to this standard form. When we have a matrix of this form, the code is 


Ty 
Z2 
Ty 
r2 . x 
: + LK or, more concisely. rH : 
< o] ? Ax 
1525 
LK ; 


D2 a-K)jXj 


So the first k terms carry the information about the vector « while the remainder are check digits. 


11.3. The Syndrome 


Let c: F* + F% be a linear code. Its image is a vector subspace of FN with dimension K. So we 
can construct a linear map 


aor’ a7" * with ker(s) = Im(c) . 
Such a linear map is called a syndrome of c. 


For example, if we write the matrix C’ for c in the standard form C = 6 then the matrix 
S=(-A I) satisfies 


ker(S) = {(3) -Au+v=o} = Ui) wert | =Im(C) . 


So S is a syndrome for the code. 
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The syndrome gives an easy way to check whether a vector « € FY is a code word, for we simply 
compute Sa and see whether it is 0. We can also define a code by giving a syndrome, for the code book 
is then ker(s). 


Exercise: 
For a natural number d there are N = 27 — 1 non-zero vectors in F¢. Take these as the columns of an 
dx N matrix S. The kernel of S' is then the code book for a Hamming code 


Ce FY Hh AD 
Check that the case d = 3 gives the Hamming code we constructed earlier. 


Show that each of these Hamming codes is a perfect 1-error correcting code. 


(The code c: F* + F and the syndrome s : FN — FY~* are really dual to one another. For the 
dual maps are 


st: RN-*K 4 RN and ch: FN = F* 


with ker(c*) = Imct = ker(s)+ = Im(s*). So s* is a linear code with c* as its syndrome. These 
correspond to transposing the matrices. So $' is the matrix for a linear code with C' its syndrome.) 


The minimum distance for a code is the minimum Hamming distance between two different code 
words. When the code is linear we have d(c’,c) = d(0,c — c’). So the minimum distance is 


min {d(0,c) : c is a non-zero code word } . 


The weight of a code word x is the Hamming distance of x from 0. So the minimum distance for a 
linear code is the minimum weight for non-zero code words. 


We will use minimum distance decoding. Suppose that we receive a word «2 € FN. We compute its 
syndrome s(a). If this is 0 then we know that a is a code word. Otherwise there have been errors in 
transmission and we need to find the code word closest to x. 


We proceed as follows. For each vector y € FY~*, find a vector z € s~!(y) with minimum weight 
and call this u(y). Note that we can do this for every vector y € FN~* in advance of receiving any 


message, at least when FY~* is not too large. It is clear that s(u(y)) = y. 


Now we decode the received vector % as co = & — u(s(a)). Since 


we see that c, is indeed a code word. For any other code word c, we have d(c, x) = d(0,x—c). However, 
s(a — c) = s(x), so the definition of u ensures that 


d(0,x — c) > d(0,u(s(a))) = d(0, x — c,) 


and hence that 
d(c,a) > d(co,#) . 


So c, is the code word closest to x. 


- 
es 
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11.4 Reed — Muller Codes 


In this section we will construct linear codes that have proved useful in dealing with very noisy 
channels. 


Let S be the finite dimensional vector space F2! over the field Fz of integers modulo 2. This is 
itself a finite set with 2 elements. Let V be the set of all functions from $ to Fo: 


V =F§ ={f: SF}. 


The functions 
ly(2) = {7 when x = y; 
4 0 otherwise 


are the standard basis for V, so V has dimension 2“. We will use this basis to determine the Hamming 
distance in V, so it is 


U(f,g) = \{w@ eS: f(x) # g(a)}] - 


Let 7; be the function 7 +> x; that maps each vector in S' to its ith component. Let I be a subset 
{i(1), i(2),...,2(r)} of {1, 2,3,..., 4} and write 7; as an abbreviation for the function 


@ H+ mia) (aw) m2) (w) -.. tir (w) = [J 2 - 
ie 
This is a function in V. In particular the empty sequence () has 7g equal to the constant function 1. 


Lemma 11.1 
The functions 7; for IC {1,2,3,...,M} are a basis for V. 


Proot: 
We can write the value of 1y(a) as the product 
M 
1y(x) = [[@: oer 1) 
i=1 


Now expand this product to get 
1y(x) = Soar (11 «| 
I ier 
for some scalar coefficients a; € Fz that depend on y. This shows that 1, is the linear combination 
ly = >0, army of the functions 77. 


Since (1,) is a basis for V, we see that the functions 7; span V. There are 2” = dimV of the 77, 
so they must form a basis for V. 


The lemma shows that we can write any function f € V as a linear combination 
Ei = S- QI7TT] . 
I 


The Reed — Muller code RM(M,r) has a code book equal to 


{So (arm: <r) sar € Fo} 


This code book is a vector subspace with dimension 
M 
‘\ or 


(0) +) 


Lecture 11 


so it gives a linear code of this rank and length dim V = 2”. 


Example: 
The RM(M,0) code has only two code words 0 and 1 so it is the repetition code that repeats each bit 
2™ times. The Hamming distance d(0,1) is 2”, so this is also the minimum distance for RM(M, 0). 


RM(M,1) has dimension 1+ M and has the functions 7; for 7 = 1,2,3,..., 4 as a basis. The 
Hamming distance satisfies 
d(0, mj) = 2-2 


and it is easy to see that this is the minimum distance for RM(M,1). 


Proposition 11.2 Reed — Muller codes 
The Reed — Muller code RM(M,r) has minimum distance 2@—" for0 <r <M. 


Proof: 
We need to show that the minimum weight of a non-zero code word in RM(M,r) is 2“—". We 
will do this by induction on M. We already know the result for M = 1. 


Suppose that we have already established the result for M — 1 and all r. 


Note first that, for the set J = {1,2,...,r} we have d(0,77) = 2”~" so the minimum distance is 
at most this large. Let f = >, army be a non-zero function in RM(M,r). We split this into two parts 
depending on whether J contains M or not. So 


f= (x om) + b cmc t™ = fotfitm . 
M¢I Mel 


The functions fo and f; do not depend on x), so we think of them as functions on tees ~!. The sum fo 
is in RM(M —1,r) while f; € RM(M —1,r—1). 


Denote by djyy the Hamming distance in V = FY and djy—, the Hamming distance in eae 
Observe that 


f(@1,%2,---,@ym_—1,0) = fo(1,2,---,£—1) on the set where x,y = 0; 
f (x1, 22, seagt M1; 1) = (fo + fi)(@1, £2, fi ,0M-1) on the set where x jy = 1. 


Since f is non-zero, either fo or fo + f; must be non-zero. If fp = 0, then f; is non-zero and 
M0.) =dui0,fjooe or, 
If fo £0 and fi = —fo, then fo = —fi € RM(M — 1,r—1) so we have 
du (0, f) =dui(0, fo) 224-9" , 
Finally, if fo 40 and fi # —fo, then 
dys (0, f) = dur—1(0, fo) + daz—1(0, fo + fi) 22x 2M-" 
since both fp and fo + f1 are non-zero elements of RM(M —1,r). In all cases we have 


imp 22° 


which completes the inductive step. 


The Reed — Muller RM(5,1) code was used by NASA in the Mariner mission to Mars. This code 
has rank 1+ 5 = 6 and length 2° = 32. So its rate of tranmission is a = =. The last proposition show 
that it has minimum distance 24 = 16. So it is 15-error detecting and 7-error correcting. 
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12: CYCLIC CODES 


A linear code is cyclic if cyce,c2...cN—2¢N—1 is a code word whenever c,c2¢3....cN—1¢Nn is a code 
word. This means that the shift map: 


Ly oN 
vg Ly 
T:FN +E; | % [ry] 2% 
IN TN-1 
maps the code book into itself. 
Example: 
The matrix 
1 0 0 0 
1 1 0 0 
0 11 0 
C=)/1 0 1 1 
0 10 1 
0 0 1 0 
00 0 1 
defines a cyclic code over Fg since, if we label the columns of C' as c€1, €2,¢3 and c4, then 
1 


=c,+02+¢C3. 


me 
— 
ie) 
w 
~~ 
lI 
OrRroeo7ncjoe 


It is easier to describe cyclic codes if we identify FY with the quotient F[X]/(X% — 1). Recall that 
F[X]/(X% — 1) is the set of equivalence classes of polynomials modulo X“ — 1. Given a polynomial 
P(X), divide it by XN — 1 to get P(X) = Q(X)(X% — 1) + R(X) with deg R(X) < N. Then P(X) is 
equivalent modulo X% — 1 to 


R(X) =rotr1X + 1rgX? +...+7ny-1X%71 € {A(X) € Fla] : deg A(X) < N}. 


We will write [P(X)] for the equivalence class of P(X) modulo X% —1. So we have seen how to identify 
[P(X)] with [R(X)] and hence with the vector 


er” . 


The shift map corresponds to 
T : FLX]/(X™ — 1) > F[X]/(X" — 1); [P(X)] 4 [X-P(X)] 
because X.XN-! = XN =1 (mod X% — 1). The quotient homomorphism will be denoted by 
q: FLX] > FLX]/(X* —1); P(X) [P(X)]. 


Proposition 12.1 Cyclic codes 
Let W be a vector subspace of F[X]/(X% — 1) and let J = q~!(W). Then W is the code book for a 
cyclic code if and only if J is an ideal of F[X]. 
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Proof: 
The quotient map q is linear, so J is certainly a vector subspace of F|X]. Suppose that W is 
cyclic. Then [P(X)] © W = [X.P(X)] € W for any polynomial P(X). This means that 


P(X)ETS>X.P(X eS. 
Therefore, if P(X) € J, then 
A(X) P(X) = (© axX*) P(X) = Soap X*.P(X) € J 
for any polynomial A(X). Hence J is an ideal of FLX]: J <a F[X]. 


Conversely, suppose that J <F[X] and that [P(X)] © W. Then P(X) € J and so X.P(X) € J. 
This means that [X.P(X)] © W. So W is cyclic. 


Corollary 12.2 Generators of cyclic codes 
Let W be the code book for a cyclic code in F[X]/(X% — 1). Then 


W = {[A(X)G(X)] : A(X) € FLX]} 
for a generator polynomial G(X) € F[X]. Moreover, G(X) is a divisor of XN — 1. 
Proof: 
The proposition shows that J = q~'(W) is an ideal of F[X]. Since F[X] is a principal 


ideal domain, this ideal must be of the form G(X)F[X] for some polynomial G(X). Therefore, W = 
{[A(X)G(X)] : A(X) € FLX]. 


Now 0 = q(X% — 1) is in W, so XN —1 € J. Hence G(X) must divide X¥ — 1. 


We may always multiply the generator polynomial G(X) by a scalar to ensure that it has leading 
coefficient 1. Then the generator polynomial is uniquely determined by the cyclic code book W. Since 
G(X) divides XN — 1, we may write XN — 1 = G(X)H(X). We call H(X) the check polynomial for 
the code. Once again, it is unique up to a scalar multiple. 


Proposition 12.3 
Let G(X) be the generator of a cyclic code of length N over F. Then the vectors 


[G(X)], [X.G(X)], [X?.G(X)], ..., [X%7?-?.G(X)], [X*—-P" G(X) 
are a basis for the code book in F[X|/(X%N — 1). Hence the code book has rank N — D where D = 
deg G(X). 


Proot: 
We know that each of the products X*.G(X) is in the ideal J, so the vectors [X*.G(X)] must 
lie in the code book W for the cyclic code. 


Every vector in W is of the form [G(X)P(X)] for some polynomial P(X) € F[X]. By reducing 
G(X).P(X) modulo X% — 1 we may ensure that deg P(X) < N—D. So any vector in W is of the form 


pol) 4b: PECCO) 4 pik CGO) +x. 4anoulk eae). 
Thus the vectors [G(X)], [X.G(X)], [X?.G(X)], ..., [KXN~?-?.G(X)], [XN-?-1.G(X)] span W. 
Suppose that these vectors were linearly dependent, say 


po[G(X)] + pi[X.G(X)] + p2[X?.G(X)] +... + py-p-1[X*~?*.G(X)] =0. 


Then 

[(po + piX + poX?+...+pn—p-1X%~?-")G(X)] = [P(X)G(X)] =0 
so (X‘ — 1)|P(X)G(X). This means that H(X)|P(X). Since deg P(X) < N—D = deg H(X), 
this implies that P(X) = 0. Hence the vectors [G(X)], [X.G(X)], [X?.G(X)], ..., [KXY7~?-?.G@(X)], 


[XN-P?-1 G(X)] are linearly independent. 
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Using the basis for the code book in this proposition, we see that a matrix for the code is the 
N x (N — D) matrix: 


go 0 0 0 
gm  ~§ Go 0 0 


0) 0 0 s+. GD 
where G(X) = go +91 X +.g2.X?+...+9pX”. 


Let S be the D x N matrix: 


0 0 0 ho hy ho 

0 0 0 hi ho 0 

0 0 0 ig: DO 
S= 

0 hn—p hn—p-1 Anes 0 0 


where H(X) =ho +hiX + hoX?4+...+hn-pX%~”. Then hy_p #0, so the rows of S are linearly 
independent. So its rank is D and its nullity must be N — D. However, it is easy to check that each 
column of the matrix C lies in the kernel of S$, because G(X)H(X) = XX —1 so 


Sea) = 0 for O<m<N. 
J 


This means that the kernel of S is spanned by the columns of C. So ker S = W = Im(C) and S isa 
syndrome matrix for the cyclic code. 


Example: 
Over the field Fz of integers modulo 2 we have 


X7-15(14+X% 4+ X81 4+ X +X? + X*). 
Let G(X) =1+ X + X°. This generates a cyclic binary code of length 7. A syndrome matrix is 
001 01 1 421 
0 1 01 1 1 ~=°0 
101 1 1 0 0 


All of the non-zero vectors in F3 occur as columns in this matrix, so the code is Hamming’s code with 
the terms of the code re-ordered. 
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13: BCH CODES 


13.1 The Characteristic of a Field 


Let K be a finite field and consider the sequence 
Ox, lx, lx +1x,1lkxet+lk«tIlx, Ilxet+lxt+lixstlk,.... 


Eventually this sequence must repeat, so we obtain a first natural number p with the sum of p terms 
le +1 +...+1K equal to 0. If p factorised, then one of the factors would also have this property, 
so p must be a prime number. It is the characteristic of K. The set of p elements 0K,1K,1K + 1k, 
lx +1K+4+1k, ... is then a subfield of kK which is isomorphic to the integers modulo p. Hence kK 
contains F,,. (A is then a vector space over F,, of some dimension r, so it has p" elements. Thus A’ has 
characteristic p when it is one of the fields F,~.) 


We will only consider fields with characteristic 2 in this section, so they are all fields of order 2” 
and will give us binary codes. The arguments readily extend to other characteristics. 


We know that any cyclic code has a generator polynomial G(X) with G(X)|(X% — 1) where N is 
the length of the code. We will assume that N is odd. It can be shown (in the Galois Theory course) 
that there is some finite field K containing F in which X% — 1 factorises completely into linear factors 
(or “splits” ). The next lemma shows when these factors are all distinct. 


Lemma 13.1  Separable fields 
Let N be an odd natural number and K a field with characteristic 2 in which XN —1 factorises completely 
into N linear factors. Then XN — 1 has N distinct roots in K. These form a cyclic group. 


Proot: 
Suppose that X% — 1 had a repeated root a in K. Then 


XX —1=(X ef PX) 
for some polynomial P(X) € K[X]. Now if we differentiate this formally we obtain 
NX*—! — O(X — a) P(X) + (X — a)? P(X) = (X — a) (P(X) + (X -a)P'(X)) . 


So X — a divides both XN — 1 and NXN~-!. Since N is odd, N1x # Ox. So X — a must divide the 
highest common factor of XN —1 and X‘~!. This highest common factor is 1, so we get a contradiction. 


We have now shown that the set S of roots of XN — 1 in K contains N distinct elements. It is 
clearly a subgroup of the group K*. We know that K™% is cyclic, so the subgroup S must be also. 


The lemma shows that we can find a primitive root a of XN —1 in the field K with the other roots 
being 


The lemma is not true if N is even. For example, X? — 1 = (X — 1)?. 


Lecture 13 9] 


13.2 BCH codes 


Let F be a finite field with characteristic 2. Choose an odd integer N and let K be a field containing 
F in which X — 1 factorises completely into linear factors. Take a as a primitive root of X‘ — 1 in the 
field K. The Bose - Ray Chaudhuri — Hocquenghem (BCH) code with design distance 6 is the cyclic 
code of length N over F defined by the ideal 


J = {P(X) € F[X] : P(a) = P(a’) = P(a*) =... = P(a®") = 0} . 
Here 6 is a natural number with 1 <6< N. 
The generating polynomial G(X) for this code is the minimal polynomial in F[X] that is 0 at each 


of the points a,a?,...,a°—! in K. It is thus the least common multiple of the minimal polynomials for 
these points. It is clearly a factor of X‘ — 1. So the BCH code is a cyclic linear code of length N. 


Example: 
Consider the polynomial X*’ — 1 over Fj. The roots of this form a cyclic group of order 7, so every root 
is a primitive root. We can factorise X’ — 1 over F2 as 


X?-1=(X -1)(X2 4 X?741)(X94+X41). 


The cubic factors are irreducible over F2 since, if they factorized further, then one of the factors would 
be linear and so give a root in Fo. 


Let a be one of the roots of X? + X +1 in K. Then a” = 1 and so 


(a)? + (a?) +1=a08+074+1=a08(1+0%+a)=0. 
This shows that a? is also a root of X? + X +1. Repeating the argument shows that a* = (a)? isa 
root. These roots are distinct, so 


X34 X+1=(X -—a)(X —a07\(X -oa%). 


Similarly we see that 


X3 4X? 41=(X —a3)(X —a°)(X -—a®). 


The BCH code with design distance 3 is given by the ideal 
J ={P(X) € F)[X] : P(a) = P(a”) = 0}. 


The generating polynomial for this ideal is X° + X + 1. We saw at the end of lecture 12 that this gave 
a rearrangement of Hamming’s code. 


We want to prove that the minimum distance for a BCH code is at least as big as the design distance 
6. To do this we need to recall a result about determinants. 


Lemma 13.2 van der Monde determinants 
The determinant 


Xi Xo X3 XK 
AP Xe Xe Ae 
ME Snicg healer AE Xk 
Xi XX XH XR 


satisfies 
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Proof: 
We will treat the (X;,) as independent indeterminants and prove the lemma by induction on 
K. The result is clearly true for K = 1. 


Consider A(X1, X2,...,X«) as a polynomial of degree K in XK with coefficients that are poly- 


nomials in X,, X9,...,XxK_-1. When Xx = 0 or Xj or Xo or... or Xx_ 1, then the determinant is 
obviously 0, so 
Xx (11 (XK - x) 
i<K 


must divide the determinant. Hence 


A(X, Xo,..-,XK) = c(X1, Xo,...,XK-1) Xk (1 (Xk - x) 
i<k 


for some polynomial c(X1, X2,...,XK-1). The leading coefficients of both sides of this equation are 
A(X1, Xo,...,XK-1) and c(X1, Xe,...,XK-1) . 


So we see that 


A(Xy, Xo,00-5 XK) = A(X, Xo, 0, Xx_1) XK (11 (XK - x) 
i<k 


which completes the induction. 


Theorem 13.3 Minimum distance for BCH codes 
The minimum distance for a BCH code with design distance 6 is at least 6. 


Proof: 
Let P(X) = po t+ pi.X +...+pn—1X%7! be a non-zero polynomial in the code book. Then 
we have 


1 Qa a? a3 nae aN-1 - 0 

1 ad at ae it a2(N-1) _ 0 

1 a ae a? axis a3(N-1) 2 — | 0 

1 ool 926-1) 936-1) o(N-D(5- PN-2 0 
PN-1 


Call the matrix above A. |Lemma 13.1] shows that any 6 — 1 columns of A are linearly independent 
because no two of the N powers of a are equal or 0. This means that there must be at least 6 non-zero 
coefficients (p;,) in order for Ap to be 0. Hence the minimum distance is at least 6. 


It remains for us to consider how to decode a message. Since our BCH code is a linear code, we 
can use the general methods for finding the closest code word to any message we receive. However, we 
can also exploit the algebraic definition of BCH codes to find more efficient ways to do this. 


We know from [Proposition 7.3] that our BCH code is t-error correcting, where t = |$(5 — 1) |. So 
suppose that a code word c is sent and we receive r = c+ e where the error vector e has at most t 
non-zero entries. How do we find e and thence recover the sent code word c? 


Let C(X), R(X) and E(X) be the polynomials with degree less than N that correspond to the 
vectors c,r and e. We know that the code word C(X) satisfies 


C(a) = C(a”) =... = C(a*1) =0. 
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So 


First calculate R(a/) for j = 1,2,...,5—1. If these are all 0 then R(X) is a code word and there 
have been no errors (or at least 6 of them). 


Otherwise let € = {7 : e; 0} be the set of indices at which errors occur and assume that 0 < |E| < ¢ 
The error locator polynomial is 
o(X) =]J(-a'X). 
icé 
This is a polynomial (over K’) of degree |E| with constant term 1. If we know o(X) then we can easily 
find which powers a~* are roots of o(X) and hence find the indices where errors have occurred. We 
could then correct the errors by changing those entries. So we need to calculate o(X). 


Consider the formal power series 


CO 


= Ss E(a4)X4 


(Since ar’ = 1, the coefficients of this power series repeat.) Note that, for the first few coefficients, 
E(a/) = R(a!) for 7 = 1,2,...,6—1. So we can calculate these coefficients in terms of the received 
word r. These are the only coefficients we will use. 


= ee => >} a" x? =>) > a" x! =v 


j=l ic€ i€E j=l icE 


N 


xX 
Write this as wh) 


o(X) 


for the polynomial 
=Soaix [[ G-o’X) 
icE ji#ki 

Note that both o(X) and w(X) have degree |E| < t 

We have shown that o(X)n(X) = w(X). Writing this out explicitly and using E(a*) = R(a*) for 
k =1,2,...,2t we obtain 
(09 +o1X +...+0,X*) (R(a)X + R(a?)X? +... 4 R(a™)X™ + Eat") X41 4) = 

=WotwX+.. tay X? . 

The coefficients of X” for t <n < 2t are 


t 
S (oj; R(a" 7) =0 


j=0 
which do not involve any of the terms E(a* . So we get equations 
Riot!) Rat)... Ra) \ [7 0 
R(at#?) R(at*) R(o?) | | 7! ' 
0 = 
"at = t 
R(a**) R(a**) 1... Ria’) nm 0 


The matrix above is a t x (¢ + 1) matrix, so there is always a vector 0 # 0 in its kernel. This tells us 
what the error locator polynomial o(X) is and hence what the errors are that we should correct. 


The special case of a BCH code where the field F is Fy and N = q — 1 is also called a Reed — 
Solomon code. These are very widely used. In particular, CDs are encoded using two interleaved Reed 
— Solomon codes. Both are defined over Fys = Fo56 with 6 = 5. They have lengths N = 32 and 28 
respectively. The interleaving spreads the information physically on the disc so that bursts of successive 
errors, as caused by a scratch, can be corrected more reliably. A CD can correct a burst of about 4,000 
binary errors. 
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14; LINEAR FEEDBACK SHIFT REGISTERS 


14.1 Definitions 


Suppose that we have K registers each of which can contain one value from the finite field F = F, 
with q elements. Initially these contain the values 79, 2%1,...,%K—1, which is called the initial fill. At 
each time step, the values in the registers are shifted down and the final register is filled by a new value 
determined by the old values in all of the registers. 


Denote the values in the registers at time t = 0,1,2,... by Xo(t), X1(t),...,Xx_1(t). Then we 
have 
Xo(t +1) = Xy (6) 
X(t +1) = Xo(t) 


XK-2(t + = XK-1(t) 
Ke PS FCO. Wa Rea. 


The function f is called the feedback function. The system is called a linear feedback shift register (LFSR) 
when the feedback function is linear, say f(Xo,X1,...,XK-1) = —coX0 — 1X1 —... — CK-1XK-~-1 80 
that 


coXo(t) + 1X1 (t) +...¢K-1XK-1(t) + Xx (t) =0 


for each t. 
Such a linear feedback shift register gives an infinite stream of values 
X0(0), Xo(1), Xo(2), ,..., Xo(t), .-. . 


The values in register r are just these values with the first r — 1 discarded and the others shifted back. 
This stream of values begins with the initial fill 


Xo(0) = x0, Xo(1) =a1, Xo(2) = 22, ..., Xo(K -1)=2K-1 
and satisfies the recurrence relation 
CoXo(t) + oy Xo(t + 1) + co Xo(t + 2) +... cK-1X0(t + K — 1) + Xo(t + K) => 0 ‘ 


The polynomial C(X) = cp +e, X +e2X?+...+c¢K_-1X*~!+ X* is the feedback polynomial. It is the 
characteristic polynomial for the recurrence relation and so determines the solutions. 


It is simple to produce such feedback shift registers using computers and so to produce such streams. 
In superficial ways the stream of values appears random but we will see that this is not the case. 


We will always assume that the first coefficient co for the feedback function is not zero. For, if 
co = 0, then the value in the Oth register does not affect any later values and we should consider the 
remaining kK — 1 registers as a simpler LFSR. 


Proposition 14.1 Periodicity for LFSR 
A linear feedback shift register is periodic. 
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This means that there is an integer N with Xo(t+ N) = Xo(t) for each time t. The smallest such 
integer N > 0 is the period for the LFSR. 


Proot: 
For each time t, let V(t) be the vector 
Xo(t) 
Xo(t+ 1) 
Vit)=| *Xot+2) | eRe, 
Xo(t + K —1) 


There are only a finite number q* of vectors in F* so the sequence (V(t)):cn must eventually repeat 
a value taken earlier. Suppose that this first occurs at V(N) with V(N) = V(j) for some 0 <j < N. 


The definition of the LFSR shows that 


0 1 Oe asus 0 
0 0 1 0 
0 0 QO: Abs 0 
Vi+)= : : : : V(t). 
0 0 Oy aus 1 
=Cg, Cy, -SCp seas SERS 


Write this as V(t+1) = MV(t). The matrix M has determinant +co, which we are assuming to be 
non-zero. So M is invertible. 


We have V(t) = M'V(0), so the first repeat gives 
MNV(0) = M/V(0) . 


If 7 were not 0, then we could multiply by M~! to get an earlier repeat. Hence we must have j = 0 and 
so MN V(0) = V(0). Consequently, the sequence of vectors (V(t)) is periodic with period N. Looking 
at the first entry in these vectors shows that (Xo(t)) is also periodic. 


Exercise: 

What happens when the coefficient co is allowed to be 0? Do we still get periodicity when the feedback 
function f is not assumed to be linear? Is the matrix MW in the proof periodic? What is its characteristic 
polynomial? 


The matrix M in the proof above clearly maps 0 to itself. Hence, the largest the period can be is 
N = q* -1. In this case the sequence V(0), V(1),...,V(N — 2), V(N—1) takes every value in F¥ \ {0} 
exactly once. In one period of length g* — 1, we then see that 0 occurs g*~! — 1 times while every 
other number in F occurs q*~! times. In this, rather weak, sense the sequence is random. However, 
the sequence is periodic so it becomes entirely predictable once we have at least one period of data. 
Let N be the period for a linear feedback shift register and set 


X(t) 
Xo(t+1) 
wit)=| *olf+2) | eR. 
Bena 
Then the vectors W(t + 1) is a cyclic shift of W(t) because Xo(t + N) = Xo(t). Also, 
coW (0) + c,.W (1) + ia% Se cK-1W(K a 1) + Wk) =0. 


So we see that the vectors W(0),W/(1),...,W(k — 1) span the code book for a cyclic code of length 
N. 
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14.2 The Berlekamp — Massey Algorithm 


Suppose that you receive a sequence of values (a;)¢en from the finite field F that you believe has 
come from a linear feedback shift register. Can you determine which linear feedback shift register has 
produced it? 


This is simply a matter of solving linear equations to find the coefficients in the feedback polynomial. 
Suppose that the sequence came from a linear feedback shift register with K registers and feedback 
polynomial 


C(X) =co +a, X 4 ex t... tea EX. 


Then we would have 
Cod, + C1 a¢41 +... + CK -1044K-1 + t+ K = 0 


for every t. Consequently, 


ao ay a2 .-) Q@K-2  QK-1 co 0 

ay a2 a3 se) QK-1 aK C1 0 

a2 a3 a4 ea ak QK41 C2 0 
= : (*) 

Qk-2 QK-1 QK  ... G2K-4 G2K-3 CK-1 0 

QK-1 GK G41 «+. G2K-3 A2K~2 1 0 


Hence the matrix above must have determinant 0. 


Berlekmap and Massey formalised this as an algorithm to find the linear feedback shift register that 
generated (a,). Begin with the smallest value of K that might be possible for the number of registers 
and compute the determinant 


ao ay ag wae ak—2 QK-1 

ay ag a3 eee QAK-1 aK 

a2 a3 a4 eae aK QAkK+1 
akK-2 G@k-1 ak ++ Q@2K-4 aQ2K—-3 
ak-1 ak Qk+1 +--+ G2k-3 42k-2 


If this is non-zero, then we can not solve the problem using K registers. So increase K by 1 and 
repeat the process. If the determinant is 0, then solve (*) above to find the coefficients c;. Then set 
up a linear feedback shift register with initial fill ag,a,,...,@.— 1 and feedback polynomial C(X) = 
Co +e, X +egX24+...+¢K-1X*-14X*. check if the stream produced by this agrees with the original 
stream. If it does we are finished. If it does not, then consider other solutions (c;) or increase Kx. 


14.3 Power Series 


Let A(X) = a9 +a:X +...+apX? and B(X) =14+60,X+...+b%X* be two polynomials in 
F[X]. If we write B(X) = 1— 6(X), then we have 


1 1 oe j 
BOR) ~ T= aCe 7 PO: 


Expanding the powers of 3(X) gives a formal power series for 1/B(X). Multiplying this by A(X) we 
obtain a formal power series for A(X)/B(X), say 
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In these formal power series we are not concerned about convergence but merely wish the coefficients 
of each power of X on each side of the equation to match. Hence we need 


A(X) = B(X) s ujX! 
j=0 


which is equivalent to 


a= > Bias tor m= 0,152 eck 
j=0 


Thus the sequence (u;) satisfies 
n 
Un = On — >. bjtin—;j forn = 0,1,...,D; 
j= 
n 
U,= — Yt forn>D. 
j=l 


This shows that the sequence (u,;) is the stream produced by a linear feedback shift register with feedback 
polynomial 
OCD) = be bg X A bee Eo ty te 


This shows that determining whether a stream of numbers from F is the output of a LFSR is the 
same as determining whether a formal power series is the quotient of two polynomials. You should 
compare this to the method used to decode a BCH code in the previous lecture. 


Exercise: 
You learnt at school that a decimal repeats if and only if it represents a rational number. How does 
this relate to LFSRs and the periodicity proved in Proposition 14.1? 
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15: CRYPTOGRAPHY 


15.1 Introduction 


In many circumstances we want to send a message so that it can only be read by the intended 
recipient. In this case we talk about the message being encrypted or enciphered. This is an important 
and very actively studied area. As computers grow more powerful we need to devise ciphers that are 
more difficult to break. 


We wish to transmit a message or plaintext which consists of a string of letters taken from a finite 
alphabet A. Usually there are a variety of different methods to encode or encipher the message, each 
depending on a key taken from a finite set K of possible keys. We will write e; : A — B for the encoding 
function or encrypting function corresponding to the key k. When the plain text is encrypted in this way 
we obtain the ciphertext. We must be able to decode or decipher the message, so there is a decrypting 
function dy : B + A with d;(e,(a)) = a for every letter a. 


Example: 
In a substitution cipher the key is a permutation « of the alphabet A. So each letter a is encrypted as 
«(a). This is deciphered by applying the inverse permutation. 


For messages in English, where the alphabet is the usual Latin alphabet, it is generally easy to 
decipher such a code. For we look at the frequency of letters in English and compare this with the 
frequencies in the ciphertext. Using this, and the fact that the message should make sense, it is easy to 
find the permutation used and so decipher messages. 


Example: 
A more secure cipher is the Vigenére cipher, named after a French Diplomat, 1523. In this the key is a 
string kyk2...kp of letters from the alphabet A. This string is repeated to give an infinite sequence 


ky koks tas kpkpsikp+e2 sae kopkap+1kep+2 ons 
with kgp+, = k,. Now a plaintext ajaz...ay is encrypted as bjb2...by where 


If the key is only one letter long, D = 1, then this is a particularly simple substitution cipher, 
called the Caesar cipher. When the key is longer, a given letter is encrypted differently depending on 
its place in the message. So simple frequency analysis will not work. Of course we could consider more 
complicated frequency analysis as was done by Babbage, who showed how to break the cipher provided 
we have access to a long enough ciphertext. 


Babbage observed that common patterns of letters (words) will be enciphered in exactly the same 
way when the distance between their starting points is a multiple of D the length of the key. So we 
should look for reasonably long words that repeat in the ciphertext and then D must divide the distance 
between those repeats. (“The Code Book”, by Simon Singh has a lot of similar historical information.) 
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What do mean by breaking a cipher? We will assume that our enemies know the methods we are 
using to encrypt messages but not the particular key we are using. So they know the possible functions 
ex, and dy but not the value of k. If they intercept a piece of ciphertext, they will want to decode it or, 
even more damagingly, to work out the key so that they can decode any future ciphertexts. 


There are various levels of attack we might protect against. 


1. Ciphertext only attack 
The enemies have a single piece of ciphertext. 


2. Known Plaintext attack 
The enemies have a piece of plaintext and the corresponding ciphertext. 


3. Chosen Plaintext attack 
The enemies can obtain arbitrary plaintexts of their choice and the corresponding ciphertexts. 


A substitution cipher is vulnerable to an attack at level 1, provided that the ciphertext is reasonably 
long and the message is in English. The Vigenére cipher is also insecure at level 1, although a much 
longer ciphertext is required and the enemies need to work harder to break the cipher. In both cases 
the enemies will be able not simply to find the original message but to find the key and so decipher 
other messages. For modern ciphers, we would want them to be secure against a level 3 attack. So we 
need to devise better ciphers. 


Note that every cipher will be vulnerable to a level 3 attack in some sense. For the system is finite: 
there are only finitely many letters and finitely many keys. So the enemies could carry out an exhaustive 
search. For each key k € K they could encipher a fixed plaintext using that key and check if it agreed 
with the known ciphertext. To have a secure cipher, we want the amount of work involved in doing 
such an exhaustive search, or the time it would take, to be prohibitively large. 


15.2 Equivocation 


Suppose that a random plaintext message M is chosen from the set M of possible messages, with 
a certain probability distribution. Then we choose a random key K € K independently of WM. The 
ciphertext is then a random variable C = cx(M) in the set C of all possible ciphertexts. 


We can think of this as transmitting the message M through a noisy channel to produce C’. The 
noise comes from the choice of the key. The entropy H(M|C) is the message equivocation. It measures 
the amount of uncertainty there is about the plaintext once we know the ciphertext. Similarly, the key 
equivocation H(K|C) measures the amount of uncertainty about the key when we know the ciphertext. 
Unsurprisingly we have 


Proposition 15.1 Message equivocation < key equivocation 
The message equivocation is no larger than the key equivocation. 


Proof: 
We have 
H(K|C) = H(K,C) — H(C) = H(K, M,C) — H(C) 


since M = dx(C) so M is a function of (K,C). Therefore, 


H(K|C) = H(K,M,C) — H(M,C) + H(M,C) — H(C) 
= H(K|M,C) + H(M|C) 


and so H(K|C) > H(M|C). 
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We say that a cipher has perfect secrecy if the ciphertext gives us no information about the plaintext, 
so H(M|C) = H(M). This means that 


H(M,C) = H(M)+H(C) 
so M and C are independent. (Equivalently the mutual information [(M,C) = H(C) — H(C|M) = 
H(C) + H(M) — H(M,C) =0.) 


Proposition 15.2 Perfect secrecy 
If a cipher has perfect secrecy, then there must be at least as many possible keys as there are possible 
plaintext messages. 


Proot: 
When we say “possible keys” or “possible messages” we mean that the probability is strictly 
positive. 


Fix a message m, € M andakey ko € K, both with strictly positive probability. Then co = ex, (mo) 
also has strictly positive probability. For any possible message m € M we have 


P(C =co) =P(C =c.|M =m). 


So there must exist some key k € K with c, = e;,(m). If two messages m1, mz give the same key k, then 
ex (m1) = Co = ex (m2) and so m; = mz. Hence the map m+> k is injective. 


This means that it is usually impractical to have perfect secrecy. However,there is a simple example 
that does give perfect secrecy. G. S. Vernam (1890 - 1960) and J. O. Mauborgne (1881 - 1971) proposed 
using a random key. We use a random number generator to produce a sequence 


kikoks ... 


of letters from the alphabet A. Each letter k; is chosen uniformly from the q = |A| letters of the 
alphabet A and the choices are independent for each position 7. Then a message m = ajda2...ay is 
enciphered as b,b2...bx where 

b; =a; +k; (mod q) . 


Each random letter k; is only used once, so this cipher is called a one-time pad. 


A one-time pad has perfect secrecy. For we have 


P(M=m,C=c)=P(M=m,K =c—m)=P(M=m)P(K =c—m) =P(M =m)—] 


where the subtraction c — m is done modulo q. Therefore, and C are independent. 


It is believed that a one-time pad is used to encrypt the Washington — Kremlin hotline. However, 
there are major problems with a one-time pad. First we need a way to produce genuinely random 
sequences of letters. This is not straightforward. Secondly, if the cipher is to be deciphered by the 
intended recipient, he or she needs to know the key sequence. So a key, just as long as the message, 
needs to be sent to the receiver in a secure way. In this way we have replaced the problem of transmitting 
a message securely by another — transmitting the key — of the same difficulty. 


Various people have been tempted to replace the random key by a suitable pseudo-random sequence. 
For example we might use a linear feedback shift register to produce a sequence of supposedly random 
letters. This is vulnerable to a level 2 attack. For, if we know a piece of plaintext and the corresponding 
ciphertext, then we can compute the key sequence from their difference. Now the Berlekamp — Massey 
algorithm gives us a simple way to find the feedback polynomial and hence to find the key. 
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15.3 Unicity 


If we try to break a simple substitution cipher by examining letter frequencies we are helped by the 
fact that we know the original message makes sense and that, for a reasonable length message, there is 
only one key that gives a sensible message from the ciphertext. We might ask how long a message we 
need for this argument to apply. This length is the idea behind the unicity. 


Suppose that our message m = a ,a2a3...ay is of length N. The letters are random variables 
Aj; giving a random message M = A,A2A3...An. We will assume that the entropy of this random 
sequence is NH for a constant H, the entropy per letter for our message. (Compare Lecture 5.) When 
this message is enciphered we get a ciphertext C = C1C2C3...Cy. The unicity is the least value of N 
for which the conditional entropy H(K|C) is 0. 


Since K and M determine C and, conversely, K and C' determine M, we have 


H(K|C) = H(K,C) — H(C) 
M, K,C) — H(C) 
M, K) — H(C) 


M)+4H(K) —H(C) 


~~ aN 


A 
val 
val 


We are already assuming that H(M) = NH. For most useful ciphers we have the ciphertext uniformly 
distributed, so H(C’) = N logs |C|. Finally, we choose the key K uniformly from K, so H(K’) = logs |K]. 
Therefore, the unicity N satisfies 0 = NH + log, |K| — N log, |C|. So 


logs |K| 


N = —>—_. . 
log, |C| — H 


If we wish to be secure, we should not use a single key for more letters than the unicity. 


Example: 

For English a reasonable estimate of the entropy per letter is 1.2 bits. Suppose we use a substitution 
cipher where the key is a permutation of the 26 letters and the space. Then |K’| = 27! so logy |K| = 93.14, 
|C| = 27 and so the unicity is 


log, 27! 93.14 


— = = 2 <2 . 
logy27—1.2  4.75—1.2 : 


So we expect to be able to find the key letter uniquely if the ciphertext is longer than 26 letters. 
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16: SYMMETRIC AND ASYMMETRIC CIPHERS 


The ciphers that we have considered so far are symmetric in the sense that both the sender and the 
receiver need to know the key in order to encrypt or decrypt the message. Such ciphers are useful but 
require that the key is kept securely and can be securely communicated from the sender to the receiver. 
If there are many people I wish to communicate with, then I will need a different key for each and need 
to keep all these keys secure. 


16.1 *Feistel Ciphers* 


Suppose that we have some simple (non-linear) functions f, : F3’ + FY for keys k € K. Then we 
can construct a more complicated and more secure cipher by using these functions in sequence. 


Let k = (ki, k2,...,kR) be a vector of R keys and consider a plaintext (ao,a1) € FY x Fi. From 
these we construct a sequence of letters ag, a1, @2,...,@R-1,4R,4R41 So that: 


(ao, 41) +> (a1, a0 + fr, (@1)) = (a1, a2) + (2,01 + fro (a2)) = (a2,a3) >... 
.+ (ar, 4R-1+ frg(ar)) = (ar, aR41) 


Then the cipher text is (az,ar41). This is called a Feistel cipher. 
Deciphering is done by using the keys in reverse order: 


(ar, @r41) > (Gr+i — fre (@R),4R) = (@R-1,4R) >... 
... > (a3 — fry (a2), 2) = (a1, 42) 4 (a2 — fr, (a1), 41) = (a0, a1) 


and is equally easy to compute. 


Many practical ciphers use this idea. For example, the DES (Data Encryption Standard, of the US 
National Bureaus of Standards, 1977 — 1988) uses Feistel ciphers. In this case N = 32 so we encipher 
binary strings of 64 bits. The transformation is repeated R = 16 times. Despite this complexity, it is 
possible to break such ciphers. Using a reasonably large parallel computing network it can be broken 
in a few days. Now more secure symmetric ciphers, such as the AES (Advanced Encryption Standard) 
are used. 


16.2 Asymmetric Ciphers 


We can also consider asymmetric ciphers where there are two keys, one for encrypting and the 
other for decrypting. For this to be sensible, it must be difficult to find the deciphering key from the 
enciphering one. 


A particularly useful version of asymmetric ciphers is public key cryptography. Here, the receiver 
publishes for everyone to see the public key that should be used to encrypt messages intended for her. 
This enables anyone to compute the encrypting function and so to encrypt a message. Thus I can 
encrypt my plaintext message and send it to the receiver. She has a private key that enables her to 
decipher this encrypted message and so recover the plaintext. The private key is kept secure by the 
receiver, so that no one else can decipher the message. 


For this idea to work, it must be hard to decipher a ciphertext without knowing the private key, even 
though we already know the public key. So, it should be simple to encrypt a message m as c = e,(m) 
using the function e, that we know from the public key. However, if we are given the ciphertext c it 
should be very hard to find which plaintext m has c = e,(m). The function e; is a “one-way function” 
where it is easy to compute e;,(m) but difficult to find e; '(c). 
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Public key cryptography has the great advantage that we do not need to send the public key 
securely since it can be published for everyone to see. Also, the private key is kept securely by the 
receiver and does not need to be sent to anyone else. Even if we wish to use a symmetric cipher to 
exchange information, it may be useful to use a public key cryptogram to send the symmetric key 
securely. 


The most commonly used public key cryptography relies on mathematical problems that are difficult 
to solve quickly. So the RSA cipher and the Rabin cipher rely on it being hard to factorise large integers. 
The ElGamal cipher relies on it being hard to find which power of a primitive root modulo N gives 
a particular integer. These problems have been found to be very difficult to solve, so we hope that 
breaking the ciphers is equally difficult. Of course, if there is a breakthrough and someone discovers 
a very fast method to solve these problems, then the ciphers will immediately become insecure. Even 
without this, as computers become faster the size of integers that can be factorised in a reasonable time 
grows larger, so we need to move to more complicated ciphers of the same type. 


Before we look at examples of public key cryptography we need to think about what makes a 
computation hard or lengthy. 
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17; COMPLEXITY 


All of the problems that we are considering are finite and so could be solved simply by considering 
all of the possibilities. However, some are much easier to solve than others. Consider for example: 


“Factorise 20,989” compared with “Show that 139 x 151 = 20,989.” 
or 
“Find n with 2" = 5 (mod 211)” compared with “Verify that “2'3? = 5 (mod 211)”. 


In each case it takes a long time to do the first and is quick to check the second. We will consider 
this relationship by asking how long it takes to perform certain calculations. 


17.1 Polynomial Time 


How long will it take to compute the value f(n) of some function f? This is likely to depend on the 
value of n and, in particular, on how big it is. We will suppose that n is written in binary and has B 
digits (so n < 2”). Each elementary operation, such as adding two binary bits, or comparing two bits, 
takes a maximum time 7. So the time to compute f(n) will be at most the time 7 multiplied by the 
number of elementary operations required to do the computation. We will say that f can be computed 
in polynomial time if there is an algorithm to compute f(n) that takes at most time cB* for all natural 
numbers n with B bits. Here c and k are some constants independent of n. Similarly, a function of 
several variables f(n1,m2,...,m,) can be computed in polynomial time if there is an algorithm that 
computes f(ni,n2,...,nN,) in at most time cB* where B = B, + By +...+B, is the sum of the bit 
lengths of the numbers nj, 72,..., Mr. 


It is easy to give examples of functions that can be computed in polynomial time. 
(a) Adding or subtracting binary integers. 
(b) Multiplying binary integers. 
(c) Division of binary integers to give a quotient and remainder. 
(d) Computing highest common factors, using Euclid’s algorithm. 
(e) Modular arithmetic (addition, subtraction, multiplication, division) modulo N. 


(f) Exponentiation modulo N. To compute a”, expand n in binary as }>n;2/ with each nj; = 0 or 1. 


J B . 
Then compute the powers a,a?,a*,...,a?’,...,a? by repeated squaring and 
J 
a” = II a? 
nj=l 


as a product. 


(g) Testing primality. Agrawal, Kayal and Saxena (2002) gave a polynomial time algorithm to test 
primality. 


However, there are functions for which polynomial time algorithms are not known. For example: 


1. Factoring 
An integer N is known to be the product of two primes p and q. Find these primes. 
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2. Discrete Logarithm 
Let a be a primitive root modulo the prime p., so a is a generator of the cyclic group F>. Given 
xz € FX, find n with a” = x (mod p). We call n the logarithm of x to base a and denote it by 
log, @. 


The elementary methods for computing 1 and 2 take much longer than polynomial time. To factor 
N we could try dividing by successive primes up to N'/?. This takes time O(N'/?) = O(28/?). To 
compute the discrete logarithm log, x, set m = [p!/?] and write n = qgm+r with 0 < q,r <m. Then 


a” =x (mod p) 3 (a™)? = xa" (mod p) . 
So compute (a)? and za~" for all 0 < qg,r < m and see where they agree. This takes time 
O(p'/? log p) = O(27/7B). 


Significantly better methods are known but these are still well short of polynomial time. Using 
number field sieves we can achieve a time 


O (exp(cBY* (log B)*!*)) : 


In practice, it seems to take a long time to solve either 1 or 2 for random large values of p, gq and x. We 
will describe various ciphers based on the difficulty of solving these problems. 


The RSA cipher relies on factoring being difficult. Up until 2007 the RSA laboratories offered 
rewards for factoring large challenge numbers. The RSA-B challenge involves factoring a number with 
B binary bits. Although the prizes have now ceased, this is still used as a test of computing power. The 
recent successes were 


Challenge Number Decimal digits Factored Prize 
RSA-576 174 December, 2003 $10,000 
RSA-640 193 November, 2005 $20,000 
RSA-704 212 July, 2012 ($30,000) 
RSA-768 232 December, 2009 ($50,000) 
RSA-896 270 ($75,000) 
RSA-1024 309 ($100,000) 


(See http://www.emc.com/emc-plus/rsa-labs/historical/ the-rsa-factoring-challenge-faq.htm or http://en.wikipedia.or 


17.2 Modular Arithmetic 


For each natural number N, the integers modulo N will be denoted by Zy. The set of integers 
modulo N that are coprime to N: 


Zo =10=0,12.44N —12(a,N)=1} 


is a multiplicative group of order y(N) — the Euler totient function. Hence Lagrange’s theorem shows 
that 
a?) = 1 (mod N) for each a with (a,N)=1. 


This is the Euler — Fermat theorem. 


When N is a prime p we know that p(p) = p—1, and Z} =F} consists of all the non-zero elements 
in the field F,. The set Z* = Fy is a cyclic group of order p(p) = p—1. An integer a is a primitive 
root modulo p if a is a generator of this cyclic group. The Euler — Fermat theorem gives Fermat’s little 
theorem: 

a? = a (mod p) for all integers a . 
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When N is a product of two distinct primes, say N = pq, then y(N) = (p— 1)(q— 1). Fermat’s 
little theorem shows that, for each integer k, 


qk -)+1 = g (mod p) for all intgers a . 


So, in particular, a*?(N)+! = @ (mod p). Similarly, a*?(%)+! = a (mod q). So we obtain 


ae) = @ (mod NV) for all integers a . 


We will also need: 


Theorem 17.1 Chinese Remainder Theorem 
Let p,q be two coprime integers, so there are integers a,b with ap + bq = 1. 
The equations x = c (mod p) and « =d (mod @q) are equivalent to 


x =c+ap(d—c) (mod pq) . 


Proof: 
Ife =c+ap(d—c) =c+(1— bq)(d —c) = d — bq(d — ), then it is clear that « =c (mod p) 
and « = d (mod q). 


Conversely, suppose that x = c (mod p) and « =d (mod q), so p|(a% — c) and q|(a — d). Then 


p\(w—e—ap(d—c)) and —q\(w—d—bg(d 0). 


This shows that pq divides x — d — bq(d — c) = x — c— ap(d—c), so 


x =c+ap(d—c) (mod pq) . 


The Chinese Remainder theorem shows that the multiplicative group homomorphism 
x x x. 
Lig 7 Zp X ZF ; a+ (a(mod p), a(mod q)) 
is an isomorphism. In particular, y(pq) = y(p)y(¢q) for coprime integers p and q. 
Proposition 17.2 Quadratic residues 


Let p be an odd prime. In the field F,,, the only square root of 0 is 0; precisely $(p — 1) elements have 
no square root and the remaining $(p — 1) have exactly two square roots. 


Proot: 
Since F,, is a field, we have 


a? = y* (mod p) S (x — y)(a + y) =0 (mod p) . 


Hence, the squaring map s : F, — F,; «++ x? (mod p) has s(x) = s(y) if and only if y = +2. This 
means that s~'(0) = {0} and every other square has two square roots. Hence there must be 4(p — 1) 


2 
squares. 
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Proposition 17.3 Square roots modulo pq 
Let N be the product of two distinct, odd primes p and q. Suppose that the integer a is a square modulo 
N. Then either: 


(a) (a, N) = 1 and there are exactly 4 square roots of a modulo N. 
(b) (a, N) = p or q and there are exactly 2 square roots of a modulo N. 


(c) a=0 (mod N) and 0 is the only square root of 0 modulo N. 


Proot: 
For an integer x we have 


x? =a (mod N) S x? =a (mod p) and x? =a (mod q) . 


When (a, N) = 1, we have 2 solutions to x? = a (mod p). Similarly, there are 2 solutions to x? = a 
(mod q). Now the Chinese remainder theorem shows that there are 4 solutions to z? = a (mod N). 


When pla there is only one solution to x? = a (mod p). This gives parts (b) and (c). 
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18: PUBLIC KEY CIPHERS 


18.1 RSA Ciphers 


The Rivest — Shamir — Adelman (RSA) ciphers are our first example of public key ciphers. They 
are very widely used and rely on the difficulty in factoring products N = pq of two large primes. 


To make an RSA cipher, I first choose two large primes p,q and set N = pq. (Typically these 
are primes with at least 800 binary bits.) Then I choose an exponent e randomly with e coprime to 
p(N) = (p— 1)(q—- 1). I can always ensure that (e, y(N)) = 1 by taking e as a prime number larger 
than both p and gq. This number e is called the encrypting exponent. Euclid’s algorithm allows me to 
find integers d,k with 

de —ky(N) =1 


(in polynomial time). The integer d is called the decrypting exponent. 


The public key will be (N,e). The encrypting function is 
atya® (mod N) . 
The private key will be d. The decrypting function is 
cr» c (mod N). 
The Euler — Fermat theorem shows that 
(a? au Ss Sa (mod NV). 
So the decrypting function is inverse to the encrypting function. 


An enemy who intercepts a ciphertext needs to find the plaintext knowing only the public key. This 
appears to be tantamount to finding the factors p and q of N so that y(N) can be computed. 


Theorem 18.1 Security of the RSA ciphers 

Suppose that we have an algorithm to determine the private key for an RSA cipher when we are given the 
public key. Then the algorithm permits us to factorise products of two distinct primes, with probability 
arbitrarily close to 1. 


If we can do this in polynomial time, then we have a very effective way to break RSA ciphers but the 
theorem shows that this would also give us an equally effective way to factorise. No such algorithm is 
known. 


Proot: 
We are assuming that we have an algorithm that gives the decrypting exponent d in terms of 
N and e. Then a“ =a (mod N) for every a € Zn. 


Write de — 1 = 2°r for some odd integer r. Let ord,(x) denote the order of an element z in the 
group ZY. Set X = {x € Zy : ord,(a") # ordg(a")}. We will prove various properties of X that 
eventually prove our theorem. 


Lemma A 
If x € X, then we can factorise N. 


Proof: 
If € X, then y = 2" satisfies y2 = 2°" = x41 = 1 in ZX, so ord,(y) and ord,(y) must be 


powers of 2. Suppose that ord,(y) = 2¢ < ord,(y). Then y? =1 (mod p) but y2 #1 (mod q). So 


(y? — 1, N) =p. 
We could therefore use Euclid’s algorithm to find one of the factors of NV. 


Thus, to factorise N when we have x € X, we compute (y2" —1,N) for t= 0,1,2,...,s. We know 
that it must be p for one of these choices and so we obtain a factor of N. 
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Now we wish to count how may elements there are in X. 
Lemma B 


For the prime p and the exponent r as above we have 
{x € Ly :ord,(2") =c}| < 4(p —1) 
for every possible order c. 


Proot: 
Let a be a primitive root modulo p. So ZF is the cyclic group generated by a. Then eal 
(mod N) so we certainly have ord,(a”)|2*. Let ord,(a”) = 2 for some 0 <t< s. 


If g = a*, then x” = (a")* and so 


ord,(#") = 


(2',k) 


Hence, ordp(x”) = 2¢ if and only if k is odd. There are $(p—1) such values for x in ZX. The remaining 
Ay 5 values for « have ord,(x") < 2'. So no more than 5(p— 1) elements from Z* can have 
ord,(x") = c. 


Lemma C 


|X| > $(p- )(q—-1) = Fy(N). 


Proof: 


The Chinese remainder theorem |Theorem 17.1] shows that |X| is 
l{(x,y) € Ze x Ze : ord, (x) 4 ord,(y")}| . 
For each y € Z; we have shown that 
{x € Z} : ordp(z") 4 ordg(y")} 


has at least $(p — 1) elements. Therefore |X| > $(p— 1)(q—- 1) = $y(N). 


Proof of Theorem 18.1 (continued) 

Choose an integer x randomly from Zx. The probability that x € X is at least 5. When a € X we 
know how to find a factor of N. When x ¢ X, choose another random value for x. After k such random 
choices we will have found a factor of N with probability at least 1 — (ey* 


Note that the theorem shows that finding the private key is as hard as factoring N. There might 
well be other ways to decipher a particular ciphertext without finding the private key d. Rivest, Shamir 
and Adelman conjectured that any algorithm that allows us to decipher messages would also allow us 
to factorise N. However, this has not been proved. 


18.2 Rabin Ciphers 


The Rabin cipher is another public key cipher that relies on the difficulty of factoring products 
N = pq of two large primes. For this we need to consider finding square roots modulo a prime. 


Lemma 18.2 
Let p be a prime of the form 4k — 1. Ifc=a 
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(mod p) then a = +c* (mod p). 


Proof: 
If c=a? (mod p), then Fermat’s little theorem gives 


2k Ak 


er ag* ang z 


=a” (mod p). 


So ck = +a (mod p). 


To make a Rabin cipher I first choose two large primes p,q of the form p = 4k —1 and q=4m~—1. 
Set N = pq. Then I will create a cipher for the alphabet Zx,. 


The public key will be N and the encrypting function is 
at+a? (mod N). 
(Usually we restrict the alphabet so that (a,N) =1 and a> N‘/?.) The private key will be (p,q). 


Suppose that we have received a ciphertext c = a? (mod N) and know the private key. Then 
Lemma 18.2 shows that 


a= dc’ (mod p) and a =ec™ (mod q) 


where 0,¢ are each +1. Find integers u,v with up + vq = 1. Then the |Chinese remainder theorem 


shows that 


a = bce + up(ec™ — dc*) (mod N) . 


All four of these possible values can occur and are distinct by | Proposition 17.3} To decipher c we find 


all four square roots modulo N and choose the one that makes sense. Our messages should contain 
enough redundancy for only one of the four choices to make sense. 


When we do not know the private key, breaking the Rabin cipher is as hard as factorising N. 


Theorem 18.3 Security of Rabin ciphers 
An algorithm to decipher the Rabin cipher gives an algorithm to factorise N. 


Proof: 
An algorithm to decipher the Rabin cipher must give one of the four square roots of a ciphertext 
c modulo N which we obtain as c= x? (mod N). 


Choose a € ZX at random. This algorithm gives a particular square root x of c= a” (mod N), so 
a? (mod N). There are 4 distinct choices for a that give the same value for c. Two of these give 
= +a but the other two do not. For these other two we must have 


x”? — a? = (x +a)(z— a) = 0 (mod N) 
but neither « + a nor x — a is divisible by N. Therefore, (N,x — a) #1. 


This means that, with probability T we find a non-trivial factor (NV, 2 — a) of N. If we repeat this 


r times, the chance of finding a factor of N is at least 1 — (4)". 
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19: DISCRETE LOGARITHM CIPHERS 


In this lecture we will look at some ciphers founded on the difficulty of solving the discrete logarithm 
problem. Throughout, p will be a large prime number and ¥ will be a primitive root modulo p. So all 
of the non-zero elements of Z, are powers of y. We will assume that the values of p and y are published 
and known to everyone. 


19.1 Diffie — Hellman Key Exchange 


First we look at how to establish a common secret key so two people can use a symmetric cipher 
to communicate securely. The method described is the Diffie — Hellman key exchange. 


Let Alice and Bob be the two people. They wish to agree on a key k € Z, to use. Alice chooses a 
random a € Zp_1; computes A = y*; and then publishes A. Similarly, Bob chooses a random b € Zp_1; 
computes B = 7° ; and publishes B. Then they take the key to be 


k = B* = A® (mod p) . 


Alice knows the value of a and the published number B so she can compute B*. Bob knows b and the 
published number A so he can compute A’. However, 


BY = (7)* = (y")? = AP (mod p) 
so Alice and Bob may take the common value of B® and A® as their common key. 


If an enemy can compute discrete logarithms efficiently, then he can find a = log, A from the 
published values of p,y and A. Then he can compute the key as B*. Diffie and Hellman conjectured 
but did not prove that finding the value of this key from the published values A and B is equivalent to 
the discrete logarithm problem. 


Shamir showed how we can use this idea to communicate securely without any public keys. Alice 
and Bob take Z, as their alphabet. As in the Diffie —- Hellman key exchange, Alice chooses a random 
a € Z,_; and computes A = y*. Alice also finds an integer a’ with aa’ = 1 (mod p—1) by using Euclid’s 
algorithm. Alice keeps both numbers a and a’ securely. Similarly, Bob chooses a random b € Z,_1; 
computes B = 7°; and finds an integer b’ with bb’ = 1 (mod p—1). He keeps both b and b’ securely. 


To send a message m € Z, from Alice to Bob, Alice computes c = m* (mod p) and sends c to Bob. 


Then Bob computes d = c’ (mod p) and sends d back to Alice. Alice now computes e = d“ (mod p) 
and sends e to Bob. Finally, Bob computes e”” (mod p). Fermat’s little theorem shows that 


/ Iye Aye < Pi 
ba dt =P =e =m =m (mod p) . 


€ 


So Bob has recovered the plaintext m. 


Alice Bob 
Private a b 
Private a’ with aa’ = 1 (mod p—1) b’ with bb! = 1 (mod p— 1) 
m 
{ 
c=m*" (mod p) — c 
{ 
d — d=c> (mod p) 
{ 
e=d* (mod p) — e 
A 
f =e” (mod p) 


This method has the advantage that no keys have to be published in order for Alice and Bob to 
communicate. However, it takes three times as long to transmit a message. 
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19.2 The Elgamal Cipher 


Elgamal showed how to adapt the Diffie —- Hellman key exchange to give a cipher. 

Bob wishes to send encrypted messages to Alice. So Alice chooses a random private key a € Zp_1; 
computes A = 7% (mod p); and publishes A as her public key. To send a message m € Zp, Bob first 
chooses a random number b € Z,_; and sends the pair 


(co, c1) = (7, A’m) € Zp x Zy . 


Alice can decipher this by computing cicg“ (mod p). For we have 


pat Par = =A? (mod p). 
So cicg ° =m (mod p). 

If an enemy knows one plaintext m and the corresponding ciphertext (co,c,), then he can find 
the two public keys A = y* (mod p) and B = 7° (mod p) used in the Diffie — Hellman key exchange. 


Breaking the Elgamal cipher is equivalent to breaking the Diffie - Hellman key exchange. We hope that 
both are as hard as computing discrete logarithms. 
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20: SIGNATURES 


So far we have been concerned with how to send messages securely from one person to another. By 
encrypting the message we try to ensure that, if the message is intercepted, no-one except the intended 
recipient can decipher it. However, there are many other security concerns. When we receive a message 
we might want to know who sent it to protect us from forgeries. We might need to know that it has not 
been altered. We might need to prove that it came from a particular person. For all of these we can 
adapt the ciphers we have studied to give us the required reassurance. 


20.1 Guarantees 


Bob sends Alice a message. They may both be concerned about: 


Secrecy 
No third party can read their message. The ciphers we described in the past two lectures achieve this. 


Integrity 

No third party has altered the message. For example, if Bob is the customer of the bank owned by 
Alice, and sends an encrypted instruction to pay £100 to Charles, then Alice needs to be sure that the 
amount and recipient have not been altered. 


The ciphers we have used do not automatically achieve this. For example, suppose that the message 
to the bank consisted of two numbers (a,m) the first specifying the recipient and the second being the 
number of pounds to pay them. These are encrypted using the RSA algorithm as (co, c1) = (a°,m*°). If 
this message is intercepted, it may be impossible to decipher but it can still be sent again requesting 
further payments. Even more worryingly, the message could be altered to (cg,c}) and then sent. This 
last is called a homomorphism attack. It uses knowledge of the form of the cipher. 


Authenticity 

Alice can be sure that Bob sent the message. The usual signatures at the bottom of letters are intended 
to ensure that the recipient knows who sent it. We want to develop similar signatures for encrypted 
messages. 


For example, if an enemy intercepts messages that end with a signature identifying the sender, he 
can swap the signatures to create chaos even if he can not understand the messages. We would like a 
signature that guarantees that the particular message comes from an identified sender. 


Non-repudiation 

Alice can prove that Bob sent a particular message. This is the role played by signatures on legal 
documents. A court can be convinced that a particular person signed the document and so accepted its 
contents. If electronic communications are to be used in law they require a similar guarantee. 


All of these problems are interrelated and we may well require a combination of the guarantees 
described above. 
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20.2 Hash Functions 


Suppose that we have a function h : A* —> BG from all messages using the alphabet A into another 
alphabet 6. There are infinitely many messages but B is finite, so there are necessarily many messages 
with the same h-values Nonetheless, it should be hard to find two messages with the same h-value — 
a collision. If it is simple to compute h(m) for any message m but hard to find a collision, then h is 
called a hash function. 


Hash functions are similar to ciphers but they compress the information. They reduce any message 
to a letter in a fixed alphabet. If the message is changed then it is very likely that the hash value will 
also change. So we can use the hash value to test the integrity of a message. In this sense it acts as a 
digest or summary of the message. 


Example: 

The md5sum of a computer file is a 128 bit binary sequence computed for a file. The value is usually 
represented as a hexadecimal string of length 32. For instance the md5sum of a text file containing the 
sentence “The quick brown fox jumped over the lazy dog’s back.” is 


ed6761a9a0d26cbe6c7a9666e07bf08d . 


Changes to the file will, with very high probability change the md5sum. For example, if we change the 
capital letter to a small letter we get: 


9c9d6b537e1ccc8623ee953bb442d147 
which is very different. 


The mdd5sum is used to check the integrity of computer files. If I download a large file over the 
internet, it may be corrupted either accidentally or maliciously. So I compute its md5sum and compare 
it with the md5sum published by the provider of the file. 


It is often simple to produce hash functions from ciphers. Consider, for example, the RSA cipher 
with public key (V,e) and private key d. For any letter a I can compute a® (mod N) and then apply a 
permutation to the binary bits of a° to give o(a°). For a message m = a,a2a3...aK I compute ox(a¥) 
for a sequence of different permutations o,, and then add the results modulo N. Changes to letters 
have an effect on the hash value that is difficult to predict so it appears to be difficult to find collisions. 


In practice hash functions tend to be produced by less algebraic means. A message is first broken 
up into blocks of fixed length M, say B,, Bo, B3,..., Bx, padding the last block if necessary. Then we 
choose a compression function (or a sequence of compression functions) C: AM x AM > A™. Starting 
with an fixed initial block J, we compute successively: 


Cy = J, C2 = C(C, Bi), C3 = C(C2, Ba), saraty Cr+ = C(Cn, Bn), ese CK41 = C(Cr, Br) ‘ 
The final value Cx +41 when we have processed the entire message is the hash value. 


We will see that hash functions provide a convenient way to guarantee messages. 
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20.3. RSA Signatures 


Bob wishes to sign a message to show that he sent it. To do this he can use his public key cipher. 
Suppose that eg is his encrypting function, which is published for everyone, and dg his decrypting 
function, which he keeps private. For simplicity we will suppose that we are dealing with a cipher where 
the encrypting function and the decrypting function are inverses of each other. So eg odg =I as well 
as dg oeg =I. This is true, for example, for the RSA cipher. 


For a simple signature, Bob just takes his name n, computes s = dg(n) using his private decrypting 
function, and appends s to his message. Anyone receiving this signature can calculate eg(s) using Bob’s 
public encrypting function and this is Bob’s name eg(dg(n)) = n. No one else has access to Bob’s private 
decrypting function, so they would find it very hard to sign the message in this way. Note, however, 
that there is nothing to prevent an enemy from separating Bob’s signature from a message he intercepts 
and attaching it to others spuriously. To avoid this we need a signature that guarantees the integrity 
of the message. 


Bob wishes to send the message m. He adds to this a signature s = dg(h(m)) where h is a publicly 
known hash function. So he sends (m, s). 


When Alice receives a message (m,s) she can confirm that Bob sent it by using his public key to 
compute eg(s). Then she checks that h(m) = eg(s). Only Bob has the decrypting function dg so only 
he could have signed the message with a signature satisfying eg(s) = h(m). Alice can also be confident 
that the message has not been altered since any alteration to m would require a corresponding alteration 
to dg(h(m)). 


Bob’s signature can also be used to prove to a third party that the message was indeed from Bob. 
For, if we believe that the cipher we are using is secure, then no one except Bob could have sent a 
signed message (m,s) with eg(s) = h(m) without knowing Bob’s private decrypting function dg. Note, 
however, that Bob can invalidate this by telling others his private key and so allowing them to forge his 
signature. Furthermore, Alice can not alter the message m that she received without discovering how 
to make the corresponding change to dg(h(m)). 


The message in this example is not encrypted but, if it needs to be, we could use Alice’s public key 
and send (e4(m),dg(e4(m))). 


The hash function plays two important roles in this signature scheme. It reduces the length of 
the signature but it also makes it hard to forge. Suppose, for example, that Bob simply signed the 
message m with s = dg(m) and sent (m,dg(m)). It is still possible for Alice to verify that the message 
is unaltered and came from Bob. However, any enemy can send (eg(c),c) using Bob’s public key and 
these also appear to be correctly signed by Bob. However, the enemy has no way of knowing what the 
“message” eg(c) says or, indeed, if it makes any sense. (Such forged messages are known as existential 
forgeries.) The hash functions defeats this sort of forgery. 


We also need to be confident that the message was signed and sent at the correct time, so we should 
insist that all messages are time stamped. This prevents messages being resent as if they were new. 


20.4 The Elgamel Signature Scheme 


The Elgamal cipher can also be used to sign messages. Let p be a large prime and ¥ a primitive 
root modulo p. Bob chooses 6 € Zp,_1 and publishes his public key B = yb e Zi; Let h be a hash 
function that takes values in Z,_1. To send a message m Bob chooses a random exponent k coprime to 


p—1and puts 


; h —b 
r= eZx g= MP" (mod p- 1). 


(Since (k,p — 1) = 1, we know that k is invertible modulo p — 1.) Bob’s signature is (r,s) so he sends 
the message (m,r, 5) to Alice. 
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When Alice receives a triple (m,r.s) she checks whether 7" = B'r® (mod p). If the triple is the 
message (m,1,s) from Bob then we have 


arm) = ape nee = (y°)" (7*)8 = B'rs (mod p) 


so Alice’s check succeeds. It is believed that the only way to forge this signature is to solve the discrete 
logarithm problem. 


In the Elgamal signature scheme it is vital that a different random exponent k is chosen for each 
message. For suppose that two messages m1, mz are signed using the same value for k, giving (mj, r, 81) 
and (mz,1, 2). Then 

h(m1) — h(m2) = k(s1 — 82) (mod p—1) . 


Recall that xs = h (mod m) has either no solutions for x or else (m,s) solutions modulo m. Hence, 


there are (p — 1,81 — s2) solutions for k modulo p— 1. Choose the one that gives the correct value for 


Mami) <0 (mod p — 1) implies that 


r=~7* (mod p). Then s; = 
br = h(m,) — ksi (mod p— 1). 


This gives (p — 1,7) solutions for b. Choose the one that gives B = y’ (mod p). In this way we have 
found Bob’s private key b as well as the exponent & that he is using for signatures. 
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21: BIT COMMITMENT 


Alice and Bob agree to decide some matter by tossing a coin. Alice tosses the coin and Bob calls. 
Suppose, however, they they are in different rooms and so Bob can not see Alice tossing the coin. If 
they do not trust one another then they need a way to ensure that Bob does not change his call when 
he hears the result, nor Alice change the reported result when she knows Bob’s guess. 


One way to do this is for Bob to write down his guess and put it in a sealed envelope, which he 
gives to Alice. Then Alice tosses the coin and reports the result. Together they open the envelope and 
see if Bob’s guess was correct. This is called a commit and reveal process. Bob commits his guess to 
paper and then, after Alice has announced the result, he reveals it to Alice. 


Similar problems arise when sending messages. For example, if someone is selling racing tips but 
needs to ensure that he is paid before revealing the tip. Or if a poll is being conducted online where 
the results will be revealed to all participants but only after everyone has voted. We can use a similar 
commit and reveal process. It is usually called bit commitment. Bob chooses a bit 0 or 1 and commits 
this to a message sent to Alice. However the message is sent in such a way that Alice can not read it 
without further information and yet Bob can not alter his choice. There are a variety of methods for 
achieving this. 


21.1 Bit Commitment using a Public Key Cipher 


Let eg and dg be the encrypting and decrypting functions for a public key cipher used by Bob. 
The encrypting function eg is published but the decrypting function dg is kept secret by Bob. Now 
Bob makes his choice m from some large alphabet A to indicate his guess, and commits to Alice the 
encrypted message c = eg(m). Provided that the cipher is secure, Alice can not decipher this. When 
the time comes to reveal his choice, Bob sends to Alice his private key so that she can find the decrypting 
function dg. Now she can compute 


dp(c) = dp(ep(m)) =m 


and find out what Bob’s choice was. She can also check that dg and ep are inverse functions so she can 
be confident that Bob has not sent the wrong private key so as to pretend his choice was different. 


Note that Bob should not simply use m = 1 for a head and m = 0 for a tail. If he did this, Alice 
could simply calculate eg(0) and eg(1) and see which Bob had sent. This is a dictionary attack, where 
Alice can simply find all possible messages and compute the corresponding ciphertexts using the public 
key eg. Instead, Bob should introduce some randomness. For example, he can pad the message 0 or 
1 by following it with a large number of random bits. Then there are a very large number of possible 
messages and an equally large number of possible ciphertexts for Alice to decipher. 


21.2 Bit Commitment using a Noisy Channel 


We can also use our earlier work on noisy channels to give an alternative method for bit commitment. 
Suppose that Alice and Bob have two ways to communicate. They can use a clear channel or a noisy 
channel. The clear channel transmits without errors but the noisy channel is a binary symmetric channel 
with error probability 0 < p< s. We assume that the noisy channel corrupts bits independently of any 
action by Alice or Bob, so neither can affect its behaviour. 


Now Alice chooses a linear binary code with codebook C c FY and minimum distance d. Bob 
chooses a random, non-trivial, linear map 6 : C — F2. Both publish their choices. In order to send a 
bit m € F2, Bob chooses a random code word c¢ € C with 0(c) = m and sends this to Alice via the noisy 
channel. So Alice receives a vector r = c+ e in which certain components of c have been altered. 
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The expected value for d(e,0) is Np and N is chosen so that this is very much larger than d. This 
means that Alice can not tell what the original codeword c was and hence can not find 6(c) = m and 
determine Bob’s choice. 


When the time comes for Bob to reveal his choice, he sends the codeword ce to Alice via the clear 
channel. Alice can now check that d(c,1r) is close to the expected value Np. If it is, she accepts that 
6(c) was Bob’s choice. If not, then she rejects it. There is, of course, a small chance that many more or 
many fewer bits of c were corrupted by the noisy channel and so Alice rejects Bob’s choice even though 
he has not cheated. We choose the parameters N,d so that the probability of this occurring is very 
small. If it does occur, Alice and Bob should just repeat the entire process. 


We have seen that Alice can not read Bob’s guess until he has revealed it. We also want to show 
that Bob can not cheat by changing his guess and sending a different code word c’ # c to Alice via the 
clear channel. Bob knows the code word ce that he originally chose but he does not know how it was 
corrupted by the noisy channel. So, all he knows is that the vector r received by Alice is at a Hamming 
distance of about Np from c. If he sends c’, then he must ensure that d(r,c’) ~ Np. The probability 
that this happens is small unless he chooses c’ very close to c. However, any two codewords are at least 
distance d apart, so we see that Bob can not cheat. 


Now let us prove these results more carefully. First fix a small number q that is an acceptable 
probability of error. We will choose the linear code at random by specifying its syndrome matrix. Let S 
be an (N — K) x N matrix over F2 with each column chosen independently and uniformly from Fi’~*. 
There are (2N-*)"N = 2(N-¥)N choices for S$. Let C be the linear code book given by 


C={zeF! :Sx=0}. 
Consider a non-zero vector c € Fi’ with weight d(c,0) = w. Then c is in the code C when the w columns 


of S corresponding to the non-zero entries of ¢ sum to 0. Hence there are (2N~*)N~1} = Q(N-K)(N~1) 


N 
such matrices S. The number of vectors c¢ with d(c,0) < dis V(N,d) = ~ ( i So the probability 
j<d 
that our random code has minimum distance at least d exceeds 


ANION = V(N, a2 OW) _ | _ VN) 


9(N-K)N QN-K 
Proposition 7.8] shows that we can find a linear code with minimum distance d provided that 


h(a[N) <1- >. 


This means that we can find codes with minimum distance at least dN for 0 < 6 < 5. 


The code word c was sent by Bob but r = c +e was received. Here the entries (e;) of e are 
independent random variables each taking the value 1 with probability p and 0 with probability 1 — p. 
So 


N 
d(c,0) = Vs e; ~ Ber(N, p) 
i=1 
is a Bernoulli random variable with mean Np and variance Np(1 — p). |Chebyshev’s inequality} gives 


p(L =P) 


P —Npl >Ne)< 
(\d(e,x) — Np| > Ne) < 


We will choose N and « so that this is less than g. Hence 
P(|d(c,r) — Np| > Ne) <q. 


Alice will accept Bob’s choice provided that |d(c,r) — Np| < Ne. So the probability of Alice incorrectly 
rejecting the choice is at most q. 
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Suppose that Bob tried to send another code word c’ in place of e with d(c,c’) = u. Let 
UaH=ti ra et and V =i G=c¢} 


so that 
Cl =e) =a and |VJ=N-w. 


Then we have r—c’ =e+(ce-—Cc’) so 


d(c',r) = 0 —e) + LS e; ~  Ber(u,l1—p)+Ber(N — u,p) . 
i€U ieV 


This is a random variable with mean u(1— p) + (N — u)p = Np+ u(1 — 2p) and variance 
u(1—p)p+(N —u)p(1 — p) = Np(1— p). Hence we have 


P (\d(c',r) — Np—u(1 — 2p)| > Ne) <q. (*) 
Choose ¢€ so that 
= <6 
1— 2p 
and then choose N so large that 
p(l—p) 
Nee <q. 


Alice will only accept the code word c’ if |d(e’,r) — Np| < Ne. Inequality (*) shows that the probability 
this happens is less than q unless 
u(1— 2p) <2Ne. 


Hence we must have 
2Ne 


1 — 2p 


d(c,c') =u< <oN. 


Since the code words c and c’ are different, we must have 
d(c,c') > 6N 


so we have a contradiction. This shows that Alice will correctly accept Bob’s choice and Bob will be 
unable to cheat except with a very small probability q. 


21.3 Semantic Security 


We have shown how to produce ciphers that are secure in the sense that an enemy who knows 
the ciphertext can not find the corresponding plaintext in polynomial time. However, we have not 
considered whether the enemy can, nonetheless, obtain some partial information about the plaintext. 
When this too is impossible, then the cipher is semantically secure. This means, in particular, that it 
is impossible to decipher even a single bit of the ciphertext. 


Semantically secure ciphers have been produced in recent years by encrypting the plaintext ran- 
domly. So, rather than replacing a single letter a € A by a fixed codeword c(a), we choose from a large 
set of possible codewords for a. This makes it very much harder to find a and hence to break the cipher. 


Goldwasser and Micali gave an example of such a scheme that uses quadratic residues. Recall that 
a number k with (k,N) = 1 is a quadratic residue modulo N if k = x? (mod N) for some z. It is 
a quadratic non-residue modulo N when there is no such x. For a prime p, we know that there is a 
primitive root y modulo N, so k is a quadratic residue modulo p when k = y?" for some integer r. This 
gives Euler’s criterion: k is a quadratic residue modulo the prime p if and only if k~))/? = 1 (mod p). 
Now consider N = pq the product of two distinct primes, p and q. The Chinese remainder theorem 
shows that & is a quadratic residue modulo N if and only if it is a quadratic residue modulo p and 
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modulo q. This means that it is simple to determine whether k is a quadratic residue provided that we 
know how to factorize N. 


Bob wishes to send to Alice a binary message m = a,@2...a% where each a; € F9. First Alice 
needs to set a key. She selects two different large primes p, gq and sets N = pq. Then she chooses a 
random integer y modulo N with y a quadratic non-residue modulo N. It is easy to find such a y since 
Alice knows the factors of N. She publishes (NV, y) as her public key. 


Now Bob encrypts his message m as e(a1)e(a2)...e(aK) where 


oe x; (mod N) when a; = 0 
AGG) = yx; (mod N) whena=1. 


Here x; is coprime to N and is chosen randomly for each letter a;. Note that the length of the ciphertext 
is very much longer than that of the plaintext, since each bit of m is encrypted as a string of logy N 
bits. 


To decrypt the ciphertext, we need to decide if each letter c; = e(a;) is a quadratic residue modulo 
N, in which case a; = 0, or a quadratic non-residue, in which case a; = 1. This is simple for Alice since 
she knows the factors p and q of N. Using Euler’s criterion she can determine whether c; is a quadratic 
residue or non-residue modulo p and q. Then it is only a quadratic residue modulo N if it is a quadratic 
residue modulo both p and q. 


If an enemy tries to decipher the ciphertext without knowing the factors p and q, then the problem 
becomes very hard. It is thought that this is as hard as finding the factors. Goldwasser and Micali proved 
that this scheme is semantically secure against an enemy who has polynomially bounded resources, at 
least provided that there is no polynomial time algorithm for determining whether a number is a 
quadratic residue modulo N. 


Random encryption like the Goldwasser — Micali scheme gives greatly improved security but also 
greatly expands the length of the ciphertexts. 
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